A simple, great summary of the BIG issues with Machine Learning

The six answers you want to have about Machine Learning, all in one place.

A simple, great summary of the BIG issues with Machine Learning /img/dilbert-machine-learning.jpg

The founder of Pinboard, Maciej Cegłowski, has just published his statement about “Privacy Rights and Data Collection in a Digital Economy”. The addendum on Machine Learning of that document is a great, simple explanation of its intrinsical limits, and the risks coming from them. To make them even simpler to understand, I took the liberty of synthesizing it in an easier Q&A format. Questions and parts in italic are my own additions, and any error is mine only. But Mr. Cegłowski made a great job, and you really should read his FULL statement, or at least this summary of mine.

What on Earth is Machine Learning?

Machine learning is a mathematical technique for training computer systems to make accurate predictions from a large corpus of training data. Applications include face recognition in digital photographs, automatic text translation and prediction of individual consumer preferences.

Why is Machine Learning so important now?

The theory of machine learning is not really new. But only in the last decade the volume of computing power and data collected (primarily) through social networks has been concentrated enough by investors that machine learning can finally demonstrate its full potential. For better or for worse.

What are the macro effects of Machine Learning on the economy and society?

Since machine learning has a voracious appetite for data and computing power, it contributes both to centralization of the tech industry, and to the pressure to maximize the collection of user data.

Are Machine Learning algorithms easy to analyze?

Not at all. They are really opaque, and this creates unique privacy problems and regulation challenges.

A key feature of machine learning is that its training phase occurs in places and times much different from its exploitation. This can obscure the links between the data used to train them and their ultimate behavior, making the consequences even less predictable, both over time, and in different sphere’s of one’s life.

Trained machine learning models reveal nothing about the data that went into it. One cannot examine an image recognition model, for example, and point to the numbers that encode ‘apple’. Since it is not possible to reconstruct the input data, it is also impossible to make models forget some data.

Do current laws and regulations handle Machine Learning properly?

No. Also because of inference, which is the capability of Machine Learning algorithms to autonomously GUESS some critical data, like your ability to pay a mortgage, from other, more or less UNRELATED data.

The legal status of Machine Learning models trained on personal data under privacy laws like the GDPR, or whether data transfer laws apply to moving a trained model across jurisdictions, is not clear.

The opacity mentioned above, combined with the capacity for inference, also make of machine learning an ideal technology for circumventing legal protections on data use.

A simple, great summary of the BIG issues with Machine Learning /img/xkcd-and-machine-learning-inference.jpg

Are Machine Learning algorithms always accurate?

No, because they suffer of biases, and vulnerability to “adversarial inputs”:

  • Any explicit or unintentional biases in the training data are reflected in the behavior of the model
  • Minimal changes in input data, not noticeable by humans, are enough to reach wrong conclusions. An image classifier that correctly identifies a picture of a horse might reclassify the same image as an apple, if even one pixel of that image changes, and recent research suggests that this is an an inherent, ineradicable feature of any machine learning system that uses current approaches

Executive summary: does Machine Learning work?

Yes… to the extent that it excels at finding latent structure in data. But never forget that Machine Learning:

  • requires centralization of huge volumes of data, both things (centralization and hugeness), that create a bunch of other huge problems, by definition
  • it is completely unaccountable because it:
    • defies many human intuitions
    • makes decisions that resist analysis
    • obscures the link between source data and outcomes
  • finally, machine learning is readily fooled by a knowledgeable adversary

as a final, personal observation… Have you noticed how much that last bullet just embeds the universal and timeless “Knowledge is power” mantra that has ruled every human community in the last thousands years?

IMAGE SOURCES: online comics collected here.