Something everybody should demand: Datasheets for datasets

(Paywall-free popularization like this is what I do for a living. To support me, see the end of this post)

Why should data be less transparent than food?

Less than one month ago, I wrote about the resistance of “Big Tech” to real control on their Artificial Intelligence activities, a resistance made obvious by how Google fired one of its top researchers, Dr. Timnit Gebru. In that post, I also summarized the impressive work of Dr. Gebru. Today, I have discovered another part of her research that really needs popularization, so let me try to do it.

You have the right to know how EVERY dataset that influences YOUR life was made

In 2018, Gebru and others wrote a paper titled “Datasheets for Datasets” that presents a simple, but really important and overdue idea: all the sets of raw data used to train the algorithms that make obscure, but increasingly important (and often discriminating) decisions should be treated just like the components that everybody can buy in any hardware or hobby store. Quoting from this much longer article about Dr. Gebru’s work:

“Electrical components are always accompanied by a datasheet specifying the details about how and where they were manufactured, and under what conditions it is safe to use them."

“Datasets should come in the same way: with a datasheets detailing how a dataset was created, in which contexts it would be appropriate to use, potential biases or ethical issues, what work is needed to maintain it, and more."

The paper also contains concrete examples, and proposals for standardization and regulation around machine learning.

“Datasheets for datasets” really seems an extremely obvious, now long overdue idea, does it not? But there is more:

Something everybody should demand: Datasheets for datasets /img/model_cards_example_roads_food.jpg

Building on that paper, Gebru and others created so-called “Model Cards for Model Reporting”, that is ways to “organize the essential facts of machine learning models in a structured way, similar to nutrition labels for food." A Model Card for an Artificial Intelligence/ Machine Learning model, that is, would “clarify intended use of [that] model, limitations, details of performance evaluation (including checking for bias), and more”.

To know more about why you want both datasets for datasheets and those “model cards”, check out the online prototype, with explanations and concrete examples, set up by Google. Above all, tell everybody to demand “Datasets for datasheets”!

Who writes this, why, and how to help

I am Marco Fioretti, tech writer and aspiring polymath doing human-digital research and popularization.
I do it because YOUR civil rights and the quality of YOUR life depend every year more on how software is used AROUND you.

To this end, I have already shared more than a million words on this blog, without any paywall or user tracking, and am sharing the next million through a newsletter, also without any paywall.

The more direct support I get, the more I can continue to inform for free parents, teachers, decision makers, and everybody else who should know more stuff like this. You can support me with paid subscriptions to my newsletter, donations via PayPal (mfioretti@nexaima.net) or LiberaPay, or in any of the other ways listed here.THANKS for your support!