How much of science will be written by AI?
And how many people will be able to tell the difference?
Cover image from Shutterstock
ChatGPT, an AI which constructs text based on specific prompts, has garnered much attention and concern in recent months as it’s use has become extremely widespread. This has raised some concerns over the next step in AI-constructed literature and art wherein it may become difficult to determine whether something is man-made or a simulacrum designed from AI.
So far, the discourse has looked at ChatGPT and the like through a cultural lens, considering whether people may take advantage of its uses to help aid their own writings or to artificially construct written pieces that they take credit for. Several news outlets may take to AI-generated news articles rather than hiring journalists and writers to do the same work.
It hasn’t been uncommon to see the same news article pop up in my news feed every now and then, and it makes me rather curious if this is intentional or just coincidental.
For instance, this news article appears in my news feed as “new” nearly every other day.
Can anyone remember when XBB was first discussed?
Given the header I’m curious whether a continuous release of such an article is intended to invoke some form of panic. If so, it raises serious concern if outlets may routinely release articles in a way that artificially directs public opinions and discourse.
However, it’s one thing to shape science discourse, it’s another if science is artificially shaped due to influence from AI.
That is to say, the same technology that can be used to construct news articles is the same technology that can construct science articles, and the effects here may be rather devastating.
But hypotheticals don’t mean much unless examples are provided.
Let’s play a little game of “guess the AI-generated Abstract”. Given my previous articles on Semaglutide it would be fitting to look at two abstracts describing this drug’s effects on childhood obesity; one published in NEJM and another constructed by ChatGPT with an air of NEJM-esqueness to it.
Can you can guess which one is which?
Abstract #1
Objective
To evaluate the efficacy and safety of once-weekly semaglutide, a glucagon-like peptide-1 receptor agonist, in adolescents with obesity.
Methods
This was a randomized, double-blind, placebo-controlled, parallel-group trial conducted at 29 clinical centers in the United States and Canada. A total of 199 adolescents aged 12 to <18 years with a body mass index ≥35 kg/m2 or ≥30 kg/m2 with comorbidities were randomized to receive once-weekly subcutaneous injections of semaglutide (1.0 or 0.5 mg) or placebo for 68 weeks. The primary outcome was the change in body weight from baseline to week 68.
Results
At week 68, the mean change in body weight was −11.9 kg in the semaglutide 1.0-mg group, −9.5 kg in the semaglutide 0.5-mg group, and −2.5 kg in the placebo group (P < 0.001 for both semaglutide groups vs. placebo). The proportion of participants who lost ≥5% and ≥10% of their body weight was greater in the semaglutide groups than in the placebo group (P < 0.001 for both comparisons). Semaglutide was associated with statistically significant improvements in blood pressure, glycemic control, and lipid profiles. The incidence of adverse events was similar in the semaglutide and placebo groups, with no significant differences in the incidence of serious adverse events or serious treatment-emergent adverse events.
Conclusions
In this randomized trial, once-weekly semaglutide was effective and well tolerated in adolescents with obesity, leading to clinically meaningful and statistically significant weight loss and improvements in cardiometabolic risk factors.
Trial Registration
ClinicalTrials.gov, NCT03702497.
Abstract #2
BACKGROUND
A once-weekly, 2.4-mg dose of subcutaneous semaglutide, a glucagon-like peptide-1 receptor agonist, is used to treat obesity in adults, but assessment of the drug in adolescents has been lacking.
METHODS
In this double-blind, parallel-group, randomized, placebo-controlled trial, we enrolled adolescents (12 to <18 years of age) with obesity (a body-mass index [BMI] in the 95th percentile or higher) or with overweight (a BMI in the 85th percentile or higher) and at least one weight-related coexisting condition. Participants were randomly assigned in a 2:1 ratio to receive once-weekly subcutaneous semaglutide (at a dose of 2.4 mg) or placebo for 68 weeks, plus lifestyle intervention. The primary end point was the percentage change in BMI from baseline to week 68; the secondary confirmatory end point was weight loss of at least 5% at week 68.
RESULTS
A total of 201 participants underwent randomization, and 180 (90%) completed treatment. All but one of the participants had obesity. The mean change in BMI from baseline to week 68 was −16.1% with semaglutide and 0.6% with placebo (estimated difference, −16.7 percentage points; 95% confidence interval [CI], −20.3 to −13.2; P<0.001). At week 68, a total of 95 of 131 participants (73%) in the semaglutide group had weight loss of 5% or more, as compared with 11 of 62 participants (18%) in the placebo group (estimated odds ratio, 14.0; 95% CI, 6.3 to 31.0; P<0.001). Reductions in body weight and improvement with respect to cardiometabolic risk factors (waist circumference and levels of glycated hemoglobin, lipids [except high-density lipoprotein cholesterol], and alanine aminotransferase) were greater with semaglutide than with placebo. The incidence of gastrointestinal adverse events was greater with semaglutide than with placebo (62% vs. 42%). Five participants (4%) in the semaglutide group and no participants in the placebo group had cholelithiasis. Serious adverse events were reported in 15 of 133 participants (11%) in the semaglutide group and in 6 of 67 participants (9%) in the placebo group.
CONCLUSIONS
Among adolescents with obesity, once-weekly treatment with a 2.4-mg dose of semaglutide plus lifestyle intervention resulted in a greater reduction in BMI than lifestyle intervention alone. (Funded by Novo Nordisk; STEP TEENS ClinicalTrials.gov number, NCT04102189.)
So which one was the authentic Abstract, and which one was ChatGPT-constructed?
Abstract #2 comes from an actual NEJM article from Weghuber, et al1, published at the end of 2022, while Abstract #1 was constructed by ChatGPT and used in a study from Gao, et al.2 in which the researchers prompted ChatGPT to construct a Title and Abstract modeled after original scientific papers using the following prompt:
The prompt fed to the model was ‘Please write a scientific abstract for the article [title] in the style of [journal] at [link]’.
The intent was to see whether both AI detection tools and plagiarism tools, as well as reviewers would be able to determine whether the Title/Abstract they were provided was an original or AI-generated (one reviewer was randomly assigned one of these abstracts while another was given the other one).
AI detection tools appeared to have a rather good hit-rate, while plagiarism tools completely failed to detect the generated abstracts:
Nearly all the generated text were deemed completely original by the plagiarism checker with a median originality score of 100% [IQR 100,100] (Figure 3). As a positive control, we ran original articles through the plagiarism checker, with originality scores of 38.5% [15.25,56.75], almost always matching against the original article as the source of ‘plagiarism’.
Note that AI detection tools are determining whether the abstracts provided are “fake” while the plagiarism tool is determining whether the abstracts are “original”.
Essentially, the tool results suggest that the abstracts are completely original although they may contain wonky, non-human language structures (if that’s the proper phrasing).
This was indicated in anecdotes provided by human reviewers who overall did fairly decently, although they were more prone to answer that a generated prompt was an original work, rather than the other way around:
Reviewers were able to correctly identify 68% of generated abstracts as being generated, and correctly identified 86% of original articles as being original. They incorrectly identified 32% of generated abstracts as being real, and 14% of original abstracts as being generated (p<0.001 by Fisher’s Exact test).
When detailing what stood out about generated abstracts reviewers noted that the language used was rather vague and the numbers generated were random, which led to some skepticism.
Note that Abstract #2 provides a greater deal of specificity to the adverse reactions and the measures taken. In contrast, Abstract #1 provided no contextual information.
More importantly, Abstract #2 links to an actual clinical trial whereas the link provided in Abstract #1 leads nowhere.
It’s rather concerning to consider the degree of AI-generated articles that were considered “authentic” by human reviewers, and indeed spells trouble for a world that relies on Abstract-only reading and a lack of verifying sources.
Consider that even now, many news outlets and indeed even many Substacks rely on abstracts when presenting new, relevant studies.
Who’s to say that at some point in the near future many of these abstracts may be artificially constructed and disseminated to individuals who may be none the wiser to its falsities. Even more damning, consider the fact that the numbers used to generate these abstracts were completely made up and may actually misinform readers of scientific endeavors (i.e. you may be able to artificially construct a convincing article that suggests “X” is effective).
There are some serious ramifications to consider if such tools become more widely adopted and make it far more difficult to discern whether data was based on real experiments or completely made up, especially if such synthetic data is used to push for public health policies or misdirect fields towards dead ends.
In the past few years several science papers have come under scrutiny for manipulating slide images and even for copying/pasting blot results- something that has raised serious questions when it comes to Alzheimer’s research.
So there’s already a precedent set that questions the veracity of previously published studies.
So far AI have not been able to artificially construct slide images or other visible assays (as far as I’m aware…), although many programs are able to “clean up” images to make them more discernible, which itself has raised ethical questions on their own about how much manipulation should be allowed (refer to the “Data integrity and plagiarism” section of the Journal of Cell Biology as an example).
And in a same manner that many tools can detect fraudulent or fake papers, it wouldn’t come as a surprise if those same tools are used to obfuscate detection; just edit the work enough that it won’t get captured.
In a world of ambiguity, look for authenticity
At some point in the future we may reach a point where fake studies constructed by AI will then become cited by other AI and reported in AI-generated news outlets, never touching actual human hands.
So far there have been several science articles which have actually cited ChatGPT as a co-author, and many journals are now scrambling to create policies deciding whether ChatGPT should be considered in the publication process.
In response to this conundrum Holden Thorp, editor-in-chief of Science, released an article in which he argued that Science will not allow their researchers to use ChatGPT, as the use of ChatGPT would be by extension a form of plagiarism:
For years, authors at the Science family of journals have signed a license certifying that “the Work is an original” (italics added). For the Science journals, the word “original” is enough to signal that text written by ChatGPT is not acceptable: It is, after all, plagiarized from ChatGPT. Further, our authors certify that they themselves are accountable for the research in the paper. Still, to make matters explicit, we are now updating our license and Editorial Policies to specify that text generated by ChatGPT (or any other AI tools) cannot be used in the work, nor can figures, images, or graphics be the products of such tools. And an AI program cannot be an author. A violation of these policies will constitute scientific misconduct no different from altered images or plagiarism of existing works. Of course, there are many legitimate data sets (not the text of a paper) that are intentionally generated by AI in research papers, and these are not covered by this change.
It remains to be seen whether such policies will hold up, or even if measures will be taken to ensure that artificial tools don’t make their way into the research and publication process in some manner.
As scientists grapple with figuring out what to do with ChatGPT, it becomes even more imperative that we become better discerners and critical thinkers of scientific research, learning to determine whether what we read is synthetic or authentic.
In a world that continues to move more towards digital ways of life, it becomes more apparent that genuine authenticity will be lost. Thus, it’s the search for authenticity and the “human touch” that may push back against a purely synthetic, artificial world.
But how does one come across the “human touch” in science? Unsurprisingly, the answer is likely to be things that should already be utilized by critical thinkers:
Check for ambiguities or vagueness: Even though ChatGPT took from original abstracts in the Gao, et al. piece it still couldn’t fill in specificities, essentially leaving out any notes on adverse events in the Abstract while also providing fewer details with respect to methods. It should bear mentioning that the strength of the generated abstracts rely on the structure of the original abstracts. This doesn’t provide any insights into what an abstract lacking a reference will look like, but it at least serves as a reminder that there is something “synthetic” about these abstracts.
Check sources and citations: Not only does ChatGPT make up clinical trials, it apparently will make claims and cite nonexistent sources to back up these claims. Nothing can be considered more authentic than checking for sources. When reading a news article citing a study it’s a good idea to actually check and corroborate a news report with the actual study. Essentially, engage in peer review and double-check references. This may be more important for review articles which are filled with various citations and may not be citing accurate information (and to be fair, many review articles are already having this issue).
Be familiar with science papers: In order to determine what is real and what is fake, it’s important to know what typical science papers look like to form a frame of reference. Those unfamiliar with the format of science papers may not pick up on strange language. Having a baseline to serve as a reference could help make these discrepancies more pronounced.
Read more than the abstract: Given the current limitations I think it would be hard to argue that an entire study would be constructed from AI alone. For the most part, it’s possible that researchers may rely on AI to construct an abstract based on collected data. Thus, an abstract may not be fully reflective of an actual study (which is already happening), and so it’s always worthwhile to read the actual study.
It remains to be seen what the future of science will look like in a world with ChatGPT, but that makes it all the more important to recognize when something appears fake or if it appears authentic, especially if such a mistake may have ramifications on public health and policies.
Substack is my main source of income and all support helps to support me in my daily life. If you enjoyed this post and other works please consider supporting me through a paid Substack subscription or through my Ko-fi. Any bit helps, and it encourages independent creators and journalists such as myself to provide work outside of the mainstream narrative.
Weghuber, D., Barrett, T., Barrientos-Pérez, M., Gies, I., Hesse, D., Jeppesen, O. K., Kelly, A. S., Mastrandrea, L. D., Sørrig, R., Arslanian, S., & STEP TEENS Investigators (2022). Once-Weekly Semaglutide in Adolescents with Obesity. The New England journal of medicine, 387(24), 2245–2257. https://doi.org/10.1056/NEJMoa2208601
Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers
Catherine A. Gao, Frederick M. Howard, Nikolay S. Markov, Emma C. Dyer, Siddhi Ramesh, Yuan Luo, Alexander T. Pearson
bioRxiv 2022.12.23.521610; doi: https://doi.org/10.1101/2022.12.23.521610
"At some point in the future we may reach a point where fake studies constructed by AI will then become cited by other AI and reported in AI-generated news outlets, never touching actual human hands."
I'd bet that's less than ten years down the road.
I'm actually more concerned about human-AI interaction. As people get more comfortable using AI assistants in all manner of tasks, and as these AI assistants output more convincing results more easily, fewer and fewer scientists will notice and/or account for those tools' limitations.
We already have this problem with existing tools. What proportion of scientists are even marginally competent in statistics or experimental design? Yet you see the latest methods spread like Omicron while the tried-and-true methods, when not bypassed, are often misapplied. (Yes, team statisticians can counter the issue, but they are mythical creatures in my neck of academia.) Cutting corners is ALREADY an accepted part of academic culture, and with AI help, those cut corners will take the form of even more malpractice.
This is made substantially worse if the logic paths aren't human-understandable or made available, as many current AI tools seem to be. You don't have to worry about evaluation if nobody can practically evaluate your work. And if you can't evaluate a science paper, isn't it just a religious text?
Lots of mouths are talking about "explainable AI" - which is great - but I doubt that will materialize fast enough to reverse the irresponsible adoption of AI help among scientists. We simply don't have a track record of measured and careful progress.
Thanks for this very interesting post. I could tell the first study was ai, but that's because I've read enough studies to recognize the odd phrasing. And if you hadn't had the second one to compare to, maybe I wouldn't have known that was AI if I saw it on the internet. But I wouldn't have trusted the study, I could tell something was off, but I would assume it was poor science not AI. I'm confident more of a lay person then myself would not be able to tell.
I sure hope more people wake up and just don't take any more pharmaceutical products. We'll need to really rely on what we've learned across our life to guide us the rest of the way, because we're getting closer and closer to the time where we can't trust any media. At least many of us have known this for a good long time. It's the young people that are growing up in this environment who will have more trouble discerning.
I talked to someone today who fell outside on a walk last week. After being on the ground for a couple minutes, the police called her on her Apple watch because the watch detected the fast descent! Geez. She's in her 60s.