Something everybody should demand: Datasheets for datasets

Why should data be less transparent than food?

Less than one month ago, I wrote about the resistance of “Big Tech” to real control on their Artificial Intelligence activities, a resistance made obvious by how Google fired one of its top researchers, Dr. Timnit Gebru. In that post, I also summarized the impressive work of Dr. Gebru. Today, I have discovered another part of her research that really needs popularization, so let me try to do it.

You have the right to know how EVERY dataset that influences YOUR life was made

In 2018, Gebru and others wrote a paper titled “Datasheets for Datasets” that presents a simple, but really important and overdue idea: all the sets of raw data used to train the algorithms that make obscure, but increasingly important (and often discriminating) decisions should be treated just like the components that everybody can buy in any hardware or hobby store. Quoting from this much longer article about Dr. Gebru’s work:

“Electrical components are always accompanied by a datasheet specifying the details about how and where they were manufactured, and under what conditions it is safe to use them."

“Datasets should come in the same way: with a datasheets detailing how a dataset was created, in which contexts it would be appropriate to use, potential biases or ethical issues, what work is needed to maintain it, and more."

The paper also contains concrete examples, and proposals for standardization and regulation around machine learning.

“Datasheets for datasets” really seems an extremely obvious, now long overdue idea, does it not? But there is more:

Something everybody should demand: Datasheets for datasets /img/model_cards_example_roads_food.jpg

Building on that paper, Gebru and others created so-called “Model Cards for Model Reporting”, that is ways to “organize the essential facts of machine learning models in a structured way, similar to nutrition labels for food." A Model Card for an Artificial Intelligence/ Machine Learning model, that is, would “clarify intended use of [that] model, limitations, details of performance evaluation (including checking for bias), and more”.

To know more about why you want both datasets for datasheets and those “model cards”, check out the online prototype, with explanations and concrete examples, set up by Google. Above all, tell everybody to demand “Datasets for datasheets”!