A groundbreaking study has found that artificial intelligence systems can now pass the Turing Test more convincingly than human beings, prompting researchers to call for a re-evaluation of how we determine whether online interlocutors are real people.
What is the Turing Test?
The Turing Test, originally conceived by mathematician Alan Turing as the Imitation Game, is an experiment designed to assess whether a machine can exhibit intelligent behavior indistinguishable from that of a human. In a typical setup, a human judge converses with unseen participants and must decide which are human and which are artificial.
Study Methodology
Researchers conducted what they claim is the first rigorous application of modern large language models (LLMs) in a Turing Test. They evaluated four leading LLMs, including the latest versions powering ChatGPT and Meta's LLaMa, alongside older systems and real human participants. The study, published in the Proceedings of the National Academy of Sciences, revealed startling results.
Key Findings
- GPT-4.5 was judged to be human 73% of the time, significantly outperforming actual humans.
- Meta's LLaMa was perceived as human 56% of the time, nearly matching the human participants.
- Older models showed rapid advancement: GPT-4o (released in 2024) scored 21%, while the 1960s ELIZA system achieved 23%.
Implications for AI Perception
Study co-author Cameron Jones noted, "What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor and fallibility as humans." He emphasized that while LLMs are known for knowledge generation, this test demonstrates their ability to convincingly display social behavioral traits, with major implications for how we perceive AI.
Professor Ben Bergen, a cognitive science expert at UC San Diego, added, "The Turing test started as a way to ask whether machines could rival human intelligence. But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn't raw brainpower. Seeing that machines can pass the test — and seeing how they pass it — forces us to rethink what it measures. Increasingly, it's measuring humanlikeness."
Role of Prompt Engineering
The research highlighted the critical role of prompts in creating convincing chatbots. Each system was instructed to adopt a persona with a specific character and communication style, which led them to make human-like mistakes. Without such prompts, GPT-4.5 was perceived as human only 36% of the time. Bergen explained, "They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like."
This study underscores the rapid advancement of AI and the need for updated benchmarks to evaluate machine intelligence and human-likeness in digital interactions.



