The nature of data

(this page is part of my Open Data, Open Society report. Please follow that link to reach the introduction and Table of Content, but don’t forget to check the notes to readers!)

“Data is the new oil? No. Data is the new soil” David McCandless @ #TED

Understanding why it is crucial that PSI is published in raw, Open and Linked formats as defined in the previous paragraph is much easier if we look at the true nature of what we call data, that is to all the consequences of the definition attempted in Chapter 2.

In any domain, not just in the Public Sector, raw data are only a starting point. Just as it happens with soil, the intrinsic value of raw data, in and by themselves, is quite low, possibly lower than their cost. This happens because what really has value is what grows on top of those data and only thanks to their availability: the decisions taken by looking at data and, maybe even more, at the connections found among apparently unrelated data from totally independent sources. Data have value if and when they affect decisions and change consequences. A crucial corollary of this nature of data, that will be discussed later in this report, is the fact that, in politics, (open) data make a difference only if enough citizens use them as a basis to vote and participate in other public activities. The more data are used, the more they become valuable, because the amount of valuable decisions, goods, products and services based on them increases. The value of the data is embedded in the value of all those “products” and it is proportional to the improvements in that value versus the situation where the data were not available. In order for all this to happen, however, data must be both reliable and really open, that is freely accessible to everybody. Graves explicitly notes that: “When public sector bodies charge for PSI, those costs can actually inhibit others from adding value. The same is true with licensing restrictions”.

In this particular moment, when many governments have already generated huge amounts of digital data but have barely started to ask themselves what openness means and whether they should bother about it, the increase in future value is much bigger for all the data that have been already created, maybe many years ago. Because in such cases all that remains (even if it is not a trivial task, of course) to create value is to open those data, that is (re)publish them in the right way. Many essential data already exist in digital format, even if not all their potential users already know it: re-generating them from scratch would be a huge waste of resources, but in some cases this is just what’s happening. Maybe the best possible example of this problem is the OpenStreetMap project: its volunteers must not only or simply add to existing maps data that weren’t available anywhere else before. At least in some countries, they also have to spend huge amounts of time to re-create maps that already exist, that is to do again a job already made, probably with bigger precision and reliability, by their own governments with their own tax money. Another example, from a deeply different sector, of how much valuable PSI may just be lying in some closet, is in “The Socioeconomic Effects of Public Sector Information on Digital Networks”, 2009:

_The (EU) Commission put our language resources online - gigabytes of pairs of languages from machine translations that allow translations into 23 languages. These resources, which are unique, are works of a team of, I would say, thousands of translators during many, many years. This is something for which it is **very difficult to substitute the work of private companies**... We put it on the Web, issued a press release, and had between 1,000 and 1,500 downloads of the whole data set in the first week._

The other parts of this chapter explain these concepts in more detail by looking at several different spheres of activity, while the next chapter provides some concrete examples from several countries. Before that, however, it is important to answer a basic question: why couldn’t the public sector offer all these products and services based on PSI data by itself, regardless of the status of those data? In other words, is opening PSI data the only way to accomplish what is described in the next paragraphs, or are there less radical alternatives?

As it will be evident from the next chapter, opening PSI data indeed is, if not the only one, by far the best solution. There are two main reasons for this statement, which are both related to the fact that raw data is like soil, therefore what is really valuable are not the raw data, but what is done thanks to their availability, that is their legal and technical openness and accessibility, no matter who does it.

First of all, no single Government apparatus (or any other single, more or less monolithic, organization) can know or figure out everything that is needed in societies as complex and interlinked as our ones. Data that may seem insignificant to the Public Administration that generated them can be valuable because they can be connected to something else, unknown to that PA, by somebody else. Besides, even if one single organization knew every way in which all PSI data can be used, it could never implement all of them by itself (even if it had the money, quite a rare condition these days). The Open Declaration on European Public Services says it clearly: “The needs of today’s society are too complex to be met by government alone". This is why data have to be published with open formats and licenses, making it possible or just leaving to all possible end users (from public bodies to individuals and businesses) to decide what to make with them.