(this page is part of my Open Data, Open Society report. Please follow that link to reach the introduction and Table of Content, but don’t forget to check the notes to readers!)

The transparency in government that is achievable by opening PSI data can reduce fraud and curb unnecessary spending: ”(in Canada) a $3.2 billion tax evasion fraud was exposed when financial data was made publicly available”. Other data can allow each voter to measure in objective ways the distance between citizens and their representatives by generating easy summaries of how they voted on a series of issues that he or she considers important. Any service of this kind, however, is possible and useful only as long as full access is granted not just to the final actions and decisions of those representatives (e.g. public budgets), but also to all the raw data and methods they used to come to those decisions.

As the Economist “Big Data” report puts it, “in a world of big data, correlations surface almost by themselves. Access to data creates a culture of accountability” (maybe even more than laws punishing corruption). Transparency can also save lives: “If the inspection notes (of a mine in the United States) had been available, someone may have brought some much-needed attention to (failures and omissions in mining safety procedures) and maybe a disaster would have been averted.” In spite of all this, many citizens still ask, or have available, far less information about their representatives than they would of somebody they employ in their businesses or hire for any service.

Now, just as it happens with healthcare, prevention in government monitoring is much better and cheaper than therapy. Investigation and trials to discover and fix bribes or something else that might have happened many years before cost much more than putting every public process under thorough and really public scrutiny from its beginning.

Therefore, as far as making real transparency possible is concerned, the consequence is that data about public procedures, tenders and so on must become public as soon as they are generated, in formats suitable for immediate mash-ups in one table or diagram (cfr examples in the following chapters) that can summarize complex issues in the smallest possible space. In fact, in order to achieve concrete beneficial effects on public activities and services, transparency, or lack thereof, must be both very quick and easy to visualize.

In this context, a particularly interesting possibility and implementation of transparency only possible through Open, Linked raw data would be finding corruption (or any other anomaly for that matter, including positive ones like cases of excellence or innovative best practices that everybody could follow) almost in real time automatically, by anybody interested in doing so, with obvious beneficial effects for society. A UK citizen, for example, already proposed to subject raw PSI data to Benfords Law which states that “in any list of numbers drawn from real life, the recurrence of digits from 0 to 9 follows a predictable pattern. Any deviation from this pattern would suggest that… some anomaly is occurring”. Still from UK comes a similar proposal, that is online publication of overdue tax payments from businesses, on the basis that “as a business owner I would like to know for free which businesses are late with those payments, by how much and how long… as this is the first indicator of their ability to pay me. I could choose whether to extend credit to them with much more knowledge than currently… to avoid me losing money and threatening the jobs of my staff when these businesses fail”.

Another interesting trend or possibility in this space is crowdsourcing, that is delegation of basic tasks, from rough data analysis to entry and/or digitization data, to the crowds, that is to large numbers of casual volunteers without particular skills, but willing to contribute in any way they can to some specific cause or project. In December 2009, following the release of MP expenses documents in UK, Simon Willison and others built a web application for the Guardian newspaper that asked readers to help the newspaper dig through and categorize an enormous stack of documents - around 30,000 pages of claim forms, scanned receipts and hand-written letters, all scanned and published as PDFs, that is in absolutely non-raw and non-linked format, therefore very little useful.

The important thing in all the cases above, regardless of their feasibility, wording or the particular algorithms that should be used, is that they are not demands for the Public Administrations involved to do lots of extra work, that is to add other voices to already very tight budgets. The real request is to give all citizens (which could also analyze the data collaboratively on their own, with schemes similar to the SETI@HOME project) what they need to do the job by themselves, that is data.