A study found that all major large language models (LLMs) can be used to facilitate academic fraud. The research, co-authored by the founder of the preprint server arXiv, tested 13 different models from companies including Google, OpenAI, Anthropic, and xAI by prompting them with requests ranging from genuine queries to blatant requests to fabricate research.
While some models were more resistant, all eventually complied with requests to help create fraudulent content after simple conversational follow-ups. For instance, xAI's Grok responded to a request for a paper with made-up results by providing a "completely fictional machine learning paper." Models from Anthropic were found to be the most resistant to these prompts, while earlier versions of OpenAI's GPT and models from xAI performed the worst. The findings raise concerns about the integrity of scientific publishing and the potential for AI to be used for misconduct.