The need to better define what is Public Data

(this page is part of my 2011 report on “Open Data: Emerging trends, issues and best practices”. Please follow that link to reach the Introduction and Table of Content, but don’t forget to also check the notes for readers! of the initial report of the same project, “Open Data, Open Society”)

Together with citizens education, there is a huge challenge that Governments and the Open Data movement will have to face (hopefully together) in 2011 and beyond. This challenge is to update and expand the definition of Public Data and to have it accepted by lawmakers and public administrators.

What is, exactly, Public Data? A definition that is accepted almost implicitly is “data that is of public interest, that belongs to the whole community, data that every citizen is surely entitled to know and use”. This definition is so generic that accepting it together with the assumption that all such data should be open as preached by the Open Data movement (online, as soon as possible, in machine readable format with an open license etc…) doesn’t create any particular problem or conflict.

Real problems however start as it has happened all too often so far, whenever we assume more or less consciously that “Public Data” in the sense defined above and data directly produced by Governments and Public Administrations, that is what’s normally called PSI (Public Sector Information) are the same thing.

There is no doubt that Governments and Public Administrations produce huge quantities of Public Data. But this is an age of privatization of many public services, from transportation to healthcare, energy and water management. This is an age in which many activities with potentially very serious impacts on whole communities, like processing of hazardous substances or toxic waste, happen outside Public Administrations. The paradox is that, as Sasaki put it, this increased privatization is happening in the very same period in which " we are observing a worldwide diffusion of access to information laws that empower citizens to hold government agencies accountable."

In such a context, “Public Data"is critical just because it is a much bigger set of data than what constitutes traditional, official PSI. “Public Data” includes all that information plus the much bigger amount of data describing and measuring all the activities of private companies, from bus timetables to packaged food ingredients, acqueducts performances and composition of fumes released in the atmosphere, that have a direct impact on the health and rights of all citizens of the communities affected by the activities of those companies.

Are such data “Public” today, in the sense defined at the beginning of this paragraph, that is something every citizen has the right to know without intermediaries or delegates, or not? Should they be public? If yes, shouldn’t law mandate that all such data be Open (that is, published online as soon as possible, in machine readable format with an open license etc…) just like, for example, the budget of some Ministry? Answering these questions may be one of the biggest challenges for the Open Data community, and for society as a whole, in the next years.

Here are, in order to facilitate reflection on this issue, a few recent, real world examples of “Public Data” that are not PSI, and of the impacts of their lack of openness.

In April 2011, John Farrell wrote:

solar and other renewable energy developers must find the best places to plug in to the grid, e.g. where demand is high or infrastructure is stressed. The cost to connect distributed generation may also be lower in these areas. Unfortunately, data about a utility's grid system is rarely public.
California utilities are changing the game. Southern California Edison (SCE) rolled out a map of its grid system, highlighting (in red) areas that "could potentially minimize your costs of interconnection to the SCE system." Since as much as a third of the cost of PV can be recaptured via its benefits to the electric grid when properly placed in the distribution system, having this information is crucial for solar developers. _Public data also levels the playing field between independent power producers and the utilities, since the latter can use federal tax credits and their proprietary knowledge of the electric grid to build their own distributed renewable energy at the most attractive locations._
Having public data on distribution grid hot spots can make renewable energy development more cost effective and more democratic. Tell your utility to publish its map.

This, instead, is an excerpt of This Data isn’t dull. It improves lives (March 2011, New York Times) that looks at public transportation and consumer safety:

The USA Department of Transportation is considering a new rule requiring airlines to make all of their prices public and immediately available online. The postings would include both ticket prices and the fees for "extras" like baggage, movies, food and beverages. The data would then be accessible to travel Web sites, and thus to all shoppers.
The airlines would retain the right to decide how and where to sell their products and services. But many of them are insisting that they should be able to decide where and how to display these extra fees. The issue is likely to grow in importance as airlines expand their lists of possible extras, from seats with more legroom to business-class meals served in coach.
Electronic disclosure of all fees can make it much easier for consumers to figure out what a trip really costs, and thus make markets more efficient, without requiring new rules and regulations.
Another initiative has been proposed by the Consumer Product Safety Commission. In 2008, Congress overwhelmingly passed and President George W. Bush signed legislation mandating an online database of reported safety issues in products, at saferproducts.gov. The Web site ran for a few months in a "soft launch" and went into full operation on Friday.
Thirteen years ago, two parents were told that their 18-month-old son had died in an accident in a model of crib in which other children had died, yet there was no easy way for any parent or child-care provider to know that.

What about food? Here is what Christian Kreutz said in January 2011:

Nutrition is another interesting sector to use open data, which I discovered lately. A last example for food is the whole potential behind barcode scanning - you take your mobile phone to the supermarket and scan products to get the information behind the fair trade certificate or behind the company. In the recent dioxin scandal in Germany, the company Barcoo took information from the ministry of agriculture in Germany, of which farms have intoxicated eggs and offer the info in their app. So, you can check in the supermarket the eggs that are fine and not with your mobile phone.

Food in supermarkets is only one of thousands cases of “Public Data” from a strategic sector of the economy that is huge, essential for creation of local jobs and in deep crisis in many countries in this period: traditional, brick and mortar retail and service businesses.

Consider this explanation by venture capital firm Greylock about why they Invested in Groupon: The Power of Data

Groupon is targeting a market that is huge and broken. Local advertising is a $100 billion annual business in the U.S. and consumers spend something like 80% of their disposable income within a couple miles of their homes. Many local businesses still try to attract new customers through that heavy yellow book that gets dropped on your front doorstep until it rots or gets tossed in the recycling bin.
We think the technologies visible to consumers will be increasingly commoditized, while the data used to understand consumers better will become increasingly proprietary and valuable.
Offers to consumers can be intelligently served up based on a person's demographics, buying history and location. The merchant side of the equation is just as interesting. Local businesses need to be able to do more than just run a sale once or twice a year. The theater on Main Street or the children's museum across town should have the ability to revenue optimize, like United Airlines or Hilton, by appropriately pricing and marketing unsold capacity. We started really leaning forward in our chairs when the discussion turned to strategy, including the ways to use data to power Groupon's future consumer- and merchant-facing products.
We believe Groupon is the break-out leader in the massive local commerce space and its investment in data will be a critical ingredient in its long term march to build a meaningful and foundational company.

Groupon is the clear market leader in the local deals market in 2011. However, complaints from merchants about the money they can loss by offering deals via Groupon already exist. Now, couldn’t all the “local deals” raw information be considered as Public Data that merchants could (be trained to) directly publish themselves online, in ways that would allow everybody, not just Groupon, to present the deals to customers in ways more profitable for merchants? The point is, how many merchants, merchant associations and majors (whose budgets always and immediately benefit when local businesses make more money) are aware of this opportunity?