Thanks To Generative AI, Catching Fraud Science Is Going To Be This Much Harder

Taasavaldanud Platon

järgijaid: 0

tunnusjoon Generatiivne AI esitab huvitavaid väljakutseid akadeemilistele kirjastajatele, kes võitlevad teadustöödes pettustega, kuna tehnoloogia näitab potentsiaali petta inimeste vastastikust eksperdihinnangut.

Describe an image for DALL-E, Stable Diffusion, and Midjourney, and they'll generate one in seconds. These text-to-image systems have rapidly improved over the past few years and what initially began as a research prototype, producing benign and wonderfully bizarre illustratsioonid 2021. aastal koertega jalutavate beebide daikon redis, on sellest ajast alates muutunud kommertstarkvaraks, mille on loonud miljardidollarilised ettevõtted ja mis suudab luua üha realistlikumaid pilte.

These AI models can produce lifelike pictures of human faces, objects, and scenes, and it's looking like a matter of time before they get good at creating convincing scientific images and data too. Text-to-image models are now widely accessible, pretty cheap to use, and they could help dodgy scientists forge results and publish sham research more easily.

Image manipulation is already a top concern for academic publishers as it's the most ühine vorm teadusliku väärkäitumise tõttu. Autorid saavad andmete võltsimiseks kasutada igasuguseid nippe, nagu sama pildi osade ümberpööramine, pööramine või kärpimine. Toimetajad on eksinud uskuma, et kõik esitatud tulemused on tõelised ja avaldavad oma töö.

Paljud kirjastajad pöörduvad nüüd AI-tarkvara poole, püüdes seda teha avastama signs of image duplication during the review process. In most cases, images have been mistakenly duplicated by scientists who have muddled up their data, but sometimes it's used for blatant fraud.

Kuid just siis, kui kirjastajad hakkavad piltide dubleerimisest aru saama, kerkib esile veel üks oht. Mõnel teadlasel võib tekkida kiusatus kasutada võltsandmete loomiseks generatiivseid tehisintellekti mudeleid. Tegelikult on tõendeid selle kohta, et võltisteadlased seda juba teevad.

Tehisintellektiga tehtud pildid, mis on paberites märgatud?

2019. aastal käivitas DARPA semantilise kohtuekspertiisi (SemaFor) programm, millega rahastatakse teadlasi, kes töötavad välja kohtuekspertiisi tööriistu, mis suudavad tuvastada tehisintellektiga loodud meediat, et võidelda desinformatsiooniga.

A spokesperson for Uncle Sam's defense research agency confirmed it has spotted fake medical images published in real science papers that appear to be generated using AI. Before text-to-image models, generative adversarial networks were popular. DARPA realized these models, best known for their ability to create deepfakes, could also forge images of medical scans, cells, or other types of imagery often found in biomedical studies.

"The threat landscape is moving quite rapidly," William Corvey, SemaFor's program manager, told Register. "The technology is becoming ubiquitous for benign purposes." Corvey said the agency has had some success developing software capable of detecting GAN-made images, and the tools are still under development.

Ohumaastik liigub üsna kiiresti

"We have results that suggest you can detect 'siblings or distant cousins' of the generative mechanism you've learned to detect previously, irrespective of the content of the generated images. SemaFor analytics look at a variety of attributions and details associated with manipulated media, everything from metadata, statistical anomalies, to more visual representations," he said.

Mõned pildianalüütikud, kes uurivad teaduslike paberite andmeid, on leidnud ka GAN-i loodud kujutisi. GAN on generatiivne võistlev võrgustik, masinõppesüsteemi tüüp, mis suudab luua kirjutist, muusikat, pilte ja palju muud.

Näiteks Sydney ülikooli molekulaarse onkoloogia professor Jennifer Byrne ja ajakirjade väljaandja EMBO Pressi kujutise terviklikkuse analüütik Jana Christopher sattusid kummalisele pildikomplektile, mis ilmus 17 biokeemiaga seotud uuringus.

Piltidel kujutati mitmeid bände, mida üldiselt tuntakse kui western blotid, which indicate the presence of specific proteins in a sample, that all curiously seemed to have the same background. That's not supposed to happen.

Joonis A Byrne-Christopheri dokumendist kahtlaste paberite kohta

Examples of repeating backgrounds in western blot images, highlighted by the red and green outlines ... Source: Byrne, Christopher 2020

2020. aastal jõudsid Byrne ja Christopher järeldusele, et kahtlase välimusega pildid loodi tõenäoliselt osana paberivabriku tegevusest: võltsandmete abil biokeemiliste uuringute paberite massiliseks tootmiseks ning nende eksperdihinnangu andmiseks ja avaldamiseks. Sellist kapparit võidakse tõmmata näiteks akadeemikutele, kes saavad hüvitist nende aktsepteeritud paberitöö alusel, või selleks, et aidata osakonnal täita avaldatud aruannete kvooti.

"The blots in the example shown in meie paber are most likely computer-generated," Christopher told Register.

Tihti kohtan võltspilte, valdavalt western blotte, aga üha enam ka mikroskoopiapilte

"Screening papers both pre- and post-publication, I often come across fake-looking images, predominantly western blots, but increasingly also microscopy images. I am very aware that many of these are most likely generated using GANs."

Elisabeth Bik, a freelance image sleuth, can often tell when images have been manipulated, too. She pores over scientific paper manuscripts, hunting for duplicated images, and flags these issues for journal editors to examine further. But it's harder to combat fake images when they have been comprehensively generated by an algorithm.

She pointed out that although the repeated background in images highlighted in the Byrne and Christopher's study is a telltale sign of forgery, the actual western blots themselves are unique. The computer vision software Bik uses to scan papers and spot image fraud would find it hard to flag these bands because there are no duplications of the actual blots.

"We'll never find an overlap. They're all, I believe, artificially made. How exactly, I'm not sure," she told Register.

It's easier to generate fake images with the latest generative AI models

GANs have largely been displaced by diffusion models. These systems generate unique pictures and power today's text-to-image software including DALL-E, Stable Diffusion, and Midjourney. They learn to map the visual representation of objects and concepts to natural language, and could significantly lower the barrier for academic cheating.

Scientists can just describe what type of false data they want generated, and these tools will do it for them. At the moment, however, they can't quite create realistic-looking scientific images yet. Sometimes the tools produce clusters of cells that look convincing at first glance, but fail miserably when it comes to western blots.

See on selline asi, mida need AI-programmid võivad genereerida:

Siin on midagi @OpenAIDALL-E kasutab bioloogilisi rakke
Täpsemalt: "rakud mikroskoobi all" ja "T-rakud skaneeriva elektronmikroskoobi all" pic.twitter.com/BgcZr3k5Q5
— Tara Basu Trivedi (@tbt94) August 23, 2022

William Gibson – arst-teadlane ja meditsiiniline onkoloogia stipendiaat, mitte kuulus autor – pakub veel näiteid siin, including how today's models struggle with the concept of a western blot.

Tehnoloogia läheb aga ainult paremaks, kuna arendajad koolitavad suuremaid mudeleid rohkemate andmete põhjal.

David Bimler, another expert at recognizing image manipulation in science papers, better known as Smut Clyde, told us: "Papermillers will illustrate their products using whatever method is cheapest and fastest, relying on weaknesses in the peer-review process."

"They could simply copy [western blots] from older papers but even that involves work to search through old papers. At the moment, I suspect, using a GAN is still some effort. Though that will change," he added.

DARPA is now looking to expand its SemaFor program to study text-to-image systems. "These kinds of models are fairly new and while in scope, are not part of our current work on SemaFor," Corvey said.

"However, SemaFor evaluators are likely to look at these models during the next evaluation phase of the program beginning Fall 2023."

Meanwhile, the quality of scientific research will erode if academic publishers can't find ways to detect fake AI-generated images in papers. In the best-case scenario, this form of academic fraud will be limited to just paper mill schemes that don't receive much attention anyway. In the worst-case scenario, it will impact even the most reputable journals and scientists with good intentions will waste time and money chasing false ideas they believe to be true. ®