Why Open Data?

(this page is part of my Open Data, Open Society report. Please follow that link to reach the introduction and Table of Content, but don’t forget to check the notes to readers!)

In this section we’ll see in more detail the main reasons that call for making as much PSI information as possible open and linked in the sense described in the next paragraph. They are transparency, economic stimulus, savings in Public Administrations and effectiveness of non-profit organizations. The value of Open Data in education will be shortly explained later in the report.

Definition of Open/Linked raw data and their impact on government

Public data are really useful only when they are raw, really open and linked. We will now define, without going into technical details, what each of these three terms mean. Only the simultaneous presence of all these three characteristics allows to get the maximum benefits from PSI. The reason is that only when data are published online in that way every citizen or organization will be able to automatically analyze and present them in easy to understand forms like Google started doing in 2009 with its public data search feature search.

Data are raw when each individual item is clearly labeled and can be immediately isolated from the others in order to be validated or reused, like the content of a single cell of a spreadsheet. Having the initial, raw data that are at the origin of some decision or action, instead of some aggregation of them, is extremely important when dealing with digital PSI. For example, publishing online in PDF format the spreadsheet containing the official budget of some city or ministry is certainly better than nothing, but it is still almost useless because those are not raw data.

A PDF file is, in fact, little more than a digital photography of the printed version of some document, that is of that part of its content, structure and meaning that immediately visible on screen or paper. Therefore, in the PDF version of any spreadsheet you can’t see anymore the formulas and raw numbers and any macro or other hidden parameter that generate the final figures in the summary sheet, so you can’t judge if those data and relations established among data inside the spreadsheet, are correct or not. In addition to that, in a PDF file you can’t modify the content of some cell to verify if and how charts or totals change as a consequence of changes in the starting numbers. The consequence is that, when the native digital form of some PSI data under consideration is a spreadsheet, only the spreadsheet itself, or some equivalent version recorded in a database, could be considered “raw”.

Similar consideration can be applied to any other form of PSI. Digital maps, for example, are made of many numbers, text strings, images and more or less regular shapes (coastlines or road paths) displayed together in one coherent view. Taking a snapshot of an interactive digital map in JPEG format will yeld a static picture of one of the countless messages it could have carried, and actually carries in its original form of dynamic aggregate of raw data. In the JPEG snapshot, instead, river paths, coastlines, roads, addresses, points of interests and elevations won’t exist anymore as single elements that can be used and recognized independently by a computer.

That’s why PSI must be made available in raw format. In all other cases the individual data cannot be reused anymore, not automatically at least!

Data are “open” when they are always published and updated online as soon and as often as possible, in a way that allows, at the lowest possible cost, to legally reuse them for free, for any purpose (including for-profit activities!) and to quick and easy automatically process them with any software. In practice, raw data are open when they have an open access license that allows what described in the previous sentence and are published in an open file format, or are directly accessible with open protocols not hindered by patents or similar restrictions, through the Internet.

Once we have open raw data, in order to make the most of them we still need (ideally in an automatic way, that is delegating all or part of the discovery and analysis work to some software program of our choice) the possibility to quickly compare them with other information from different sources. This need, and the related need to quickly find which other data may be relevant for comparison, is what leads to the concept of linked data and their importance for Open Government. It is both impossible and not desirable, for economical, technical and political reasons, to have one single, huge database for all kinds of PSI. Consequently, it is necessary to facilitate as much as possible the automatic linking, mixing and comparison of the contents of different, independently owned and maintained online public databases. With linked data, says the WWW inventor Director Sir Tim Berners-Lee “when you have some of it, you can find other, related, data”. This concept is also explained in the “5 stars of open linked data” paper. The practical definitions that follow are, in a sense, a technical version of the “follow the money” mantra used in investigative journalism. Here is a synthesis of those definitions:

  • each digital data object or resource should have a unique name and be accessible from the Internet with the same protocols used for normal web pages and services.

  • data must be available in non-proprietary, structured formats that make it easy to discover them and to associate to them links to other related objects or resources

Please note that Open is not the same as Linked: all PSI that is “public” can and should be both Linked and Open. In practice, though, it is equally possible to find Linked Data that aren’t Open for licensing reasons and Open Data in formats that don’t make automatic linking possible for purely technical reasons.