by Thorin Klosowski at Electronic Frontier Foundation
Both OpenAI and Google have released guidance for website owners who do not want the two companies using the content of their sites to train the company’s large language models (LLMs). We’ve long been supporters of the right to scrape websites—the process of using a computer to load and read pages of a website for later analysis—as a tool for research, journalism, and archivers. We believe this practice is still lawful when collecting training data for generative AI, but the question of whether something should be illegal is different from whether it may be considered rude, gauche, or unpleasant. As norms continue to develop around what kinds of scraping and what uses of scraped data are considered acceptable, it is useful to have a tool for website operators to automatically signal their preference to crawlers. Asking OpenAI and Google (and anyone else who chooses to honor the preference) to not include scrapes of your site in its models is an easy process as long as you can access your site’s file structure.
We’ve talked before about how these models use art for training, and the general idea and process is the same for text. Researchers have long used collections of data scraped from the internet for studies of censorship, malware, sociology, language, and other applications, including generative AI. Today, both academic and for-profit researchers collect training data for AI using bots that go out searching all over the web and “scrape up” or store the content of each site they come across. This might be used to create purely text-based tools, or a system might collect images that may be associated with certain text and try to glean connections between the words and the images during training. The end result, at least currently, is the chatbots we’ve seen in the form of Google Bard and ChatGPT.
If you do not want your website’s content used for this training,…
Continue Reading