Facebook Scraped 1 Billion Pictures From Instagram to Train Its A.I.

by Dave Gershgorn at OneZero

Facebook researchers announced a breakthrough yesterday: They have trained a “self-supervised” algorithm using 1 billion Instagram images, proving that the algorithm doesn’t need human-labeled images to learn to accurately recognize objects.

Typically, the most accurate image recognition algorithms require humans to label images as containing dogs, horses, people, or any other subject, and then the algorithm can find similarities between images humans have indicated contain the same objects. Facebook’s chief A.I. scientist Yann LeCun has been on a mission to change A.I.’s reliance on labels for decades, calling it the “holy grail” of A.I.

But Facebook didn’t just select any billion Instagram images to train the algorithm. The team purposely excluded Instagram images from the European Union, noting in its paper that images were “random, public, and non-EU images.” While the rest of the world’s Instagram images are fair game, EU residents don’t have to worry about their images being used to generate Facebook’s next big algorithm.

OneZero asked Facebook whether the exclusion was motivated by the EU’s GDPR regulations, which gives users greater insight into how companies use their data and protects against data use without consent. A Facebook spokesperson acknowledged the question, but did not immediately reply to the request for comment.

Whether it was because the use of data would be a GDPR violation, or just that Facebook didn’t want to give the impression of impropriety, it’s likely that the law had a chilling effect on the use of private data.

Jules Polonetsky, CEO of Future of Privacy Forum, told OneZero in a message that it’s not unusual for companies to err on the side of caution when collecting data in Europe.

“[It’s] quite common for global companies to be more limited in how they use data subject to GDPR,” he wrote, noting that explicit informed consent is often required for use of sensitive data.

Instagram’s terms of use give Facebook enormous freedom to do whatever it wants with your data, by giving the company a license to use, replicate, and modify any information you upload to the platform. But EU courts have decided that large-scale scraping of personal data, especially images, violates GDPR. For instance, a German court decided Clearview AI’s data scraping practices violated the European privacy law. In another decision against web-scraping, Polish regulators found that a digital marketing company had not adequately obtained users’ consent when processing their data.

Facebook’s data practices have been highly criticized around the world, whether under GDPR, newer privacy laws, or more recently in the United States. A recent settlement in Illinois left Facebook with a $650 million bill for violating the states’ Biometric Information Privacy Act by processing images with facial recognition.

In early May 2018, just weeks before the European data guidelines went into effect,…