Google Declares Public Data Is Fair Game For Training Its AI

Republished By Plato

Followers: 0

Google has updated its privacy policy to confirm it scrapes public data from the internet to train its AI models and services – including its chatbot Bard and its search engine that now offers to generate answers on-the-fly to queries.

The fine print under research and development now reads: "Google uses information to improve our services and to develop new products, features and technologies that benefit our users and the public. For example, we use publicly available information to help train Google's AI models and build products and features like Google Translate, Bard and Cloud AI capabilities."

We use publicly available information to help train Google's AI models and build products and features

Interestingly, Reg staff outside the USA could not see the text quoted at the above link. However this PDF version of Google's policy states: "We may collect information that's publicly available online or from other public sources to help train Google's AI models and build products and features, like Google Translate, Bard and Cloud AI capabilities."

The changes define Google's scope for AI training. Previously, the policy only mentioned "language models" and referred to Google Translate. But the wording has been altered to cover "AI models" and includes Bard and other systems built as applications on its cloud platform.

A Google spokesperson told The Register that the update hasn't fundamentally changed the way it trains its AI models.

"Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply clarifies that newer services like Bard are also included. We incorporate privacy principles and safeguards into the development of our AI technologies, in line with our AI Principles," the spokesperson said in a statement.

Developers have scraped the internet, photo albums, books, social networks, source code, music, articles, and more, to collect training data for AI systems for years. The process is controversial, however, considering material is typically protected by copyright, terms of use, and licenses, and the whole thing has led to lawsuits.

Some folks are unhappy that their own content is not only being used to build machine learning systems that replicate their work, and thus potentially endanger their livelihoods, but that the output of the models flies too close to copyright or license infringement by regurgitating this training data unaltered.

AI developers may argue that their efforts fall under fair use, and that what the models output is a new form of work and not actually a copy of the original training data. It's a hotly debated problem.

Stability AI, for example, has been sued by Getty Images for harvesting and misusing millions of images from its stock image website to train its text-to-image tools. Meanwhile, OpenAI and its owner Microsoft have also been hit with multiple lawsuits, accusing it of inappropriately scraping "300 billion words from the internet, 'books, articles, websites and posts – including personal information obtained without consent'," and slurping source code from public repositories to create the AI-pair programming tool GitHub Copilot.

Google's rep declined to clarify whether or not the ad and search giant would scrape public copyrighted or licensed data or social media posts to train its systems.

Now that people are better informed about how AI models are trained, some internet businesses have started charging developers for access to their data. Stack Overflow, Reddit, and Twitter, for example, this year introduced charges or new rules for accessing their content through APIs. Other sites like Shutterstock and Getty have chosen to license their images to AI model builders, and have partnered up with the likes of Meta and Nvidia. ®