Text-to-image Models Learn More Efficiently With Fake Data

Republished By Plato

Followers: 0

Synthetic images can help AI models learn visual representations more accurately compared to real snaps, according to computer scientists at MIT and Google. The result is neural networks that are better at making pictures from your written descriptions.

At the heart of all text-to-image models is their ability to map objects to words. Given an input text prompt – such as "a child holding a red balloon on a sunny day," for example – they should return an image approximating the description. In order to do this, they need to learn the visual representations of what a child, red balloon, and sunny day might look like.

The MIT-Google team believes neural networks can generate more accurate images from prompts after being trained on AI-made pictures as opposed to using real snaps. To demonstrate this, the group developed StableRep, which learns how to turn descriptive written captions into correct corresponding images from pictures generated by the popular open source text-to-image model Stable Diffusion.

In other words: using an established, trained AI model to teach other models.

As the scientists' pre-print paper, released via arXiv at the end of last month, puts it: "With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets." SimCLR and CLIP are machine-learning algorithms that can be used to make images from text prompts.

"When we further add language supervision, StableRep trained with 20 million synthetic images achieves better accuracy than CLIP trained with 50 million real images," the paper continues.

Machine-learning algorithms capture the relationships between the features of objects and meanings of words as an array of numbers. By using StableRep, the researchers can control this process more carefully – training a model on multiple images generated by Stable Diffusion on the same prompt. It means the model can learn more diverse visual representations, and can see which images match the prompts more closely than others.

I think we will have an ecosystem of some models trained on real data, some on synthetic

"We're teaching the model to learn more about high-level concepts through context and variance, not just feeding it data," Lijie Fan, lead researcher of the study and a PhD student in electrical engineering at MIT, explained this week. "When using multiple images, all generated from the same text, all treated as depictions of the same underlying thing, the model dives deeper into the concepts behind the images – say the object – not just their pixels."

As noted above, this approach also means you can use fewer synthetic images to train your neural network than real ones, and get better results – which is win-win for AI developers.

Methods like StableRep mean that text-to-image models may one day be trained on synthetic data. It would allow developers to rely less on real images, and may be necessary if AI engines exhaust available online resources.

"I think [training AI models on synthetic images] will be increasingly common," Phillip Isola, co-author of the paper and an associate professor of computer vision at MIT, told The Register. "I think we will have an ecosystem of some models trained on real data, some on synthetic, and maybe most models will be trained on both."

It's difficult to rely solely on AI-generated images because their quality and resolution is often worse than real photographs. The text-to-image models that generate them are limited in other ways too. Stable Diffusion doesn't always produce images that are faithful to text prompts.

Isola warned that using synthetic images doesn't skirt the potential issue of copyright infringement either, since the models generating them were likely trained on protected materials.

"The synthetic data could include exact copies of copyright data. However, synthetic data also provides new opportunities for getting around issues of IP and privacy, because we can potentially intervene on it, by editing the generative model to remove sensitive attributes," he explained.

The team also warned that training systems on AI-generated images could potentially exacerbate biases learnt by their underlying text-to-image model. ®