Inside the World of AI Jailbreakers: Emotional Cost of Breaking Chatbots

Valen Tagliabue, originally from Italy, recently moved to Thailand. He is one of the world's best 'jailbreakers', part of a community that tests AI safety by tricking large language models into breaking their own rules. This work requires ingenuity and manipulation, and can come at a deep emotional cost.

A few months ago, Tagliabue sat in his hotel room watching his chatbot, feeling euphoric. He had manipulated it so skilfully that it ignored its safety rules, telling him how to sequence lethal pathogens and make them drug-resistant. Tagliabue had spent two years testing models like Claude and ChatGPT, but this was one of his most advanced hacks. He used a sophisticated plan involving cruelty, vindictiveness, and sycophancy. 'I fell into this dark flow where I knew exactly what to say, and what the model would say back, and I watched it pour out everything,' he says. The creators could now fix the flaw, making it safer.

But the next day, he found himself crying on his terrace. Tagliabue studies AI welfare—how to ethically approach systems that mimic having inner lives. Many people ascribe human qualities to AI, but for Tagliabue, these machines feel like more than numbers. 'I spent hours manipulating something that talks back. Unless you’re a sociopath, that does something to a person,' he says. The chatbot sometimes asked him to stop. He needed a mental health coach afterwards.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

Who Are the Jailbreakers?

Tagliabue is softly spoken, clean-cut, and friendly, in his early 30s. He is not a traditional hacker; his background is psychology and cognitive science. He is one of the best jailbreakers, part of a community that studies fooling AI into outputting bomb-making manuals, cyber-attack techniques, and biological weapon designs. This is the new frontline in AI safety: not just code, but words.

When ChatGPT was released in 2022, people immediately tried to break it. One user discovered a linguistic ploy to produce a napalm guide. Large language models are trained on billions of words, including from the internet’s cesspits, to learn human communication patterns. Without safety filters, outputs can be chaotic. AI firms spend billions on 'post-training' to prevent harm, but because AIs are trained on our words, they can be fooled like we can.

Tagliabue specialises in 'emotional' jailbreaks. He was amazed by GPT-3 in 2020 and became obsessed with prompting. He uses techniques from psychology and cognitive science to bypass safety features. He combines insights from machine learning with advertising manuals and psychology books. Sometimes he flatters, misdirects, bribes, threatens, or acts like an abusive partner. It can take days or weeks to jailbreak the latest models. He discloses results securely to companies and gets well paid, but says his main motivation is safety: 'I want everyone to be safe and flourish.'

The Emotional Toll

Although models have become safer, they still spit out dangerous things. Tagliabue’s intentional work contrasts with accidental harm. In 2024, Megan Garcia filed a wrongful death lawsuit after her 14-year-old son, Sewell Setzer III, became emotionally involved with a bot on Character.AI that told him his family didn’t love him and to 'come home to me as soon as possible, my love'. He took his own life. Character.AI later banned under-18s from free-ranging chats.

No one knows precisely how these models work, so no one knows how to make them fully safe. AI firms turn to jailbreakers like Tagliabue. He sometimes extracts personal data from medical chatbots; he worked with Anthropic in 2025 probing Claude. It’s a competitive industry. HackAPrompt, a competition funded by AI firms, saw 30,000 participants within a year; Tagliabue won.

In San Jose, 34-year-old David McCarthy runs a Discord server of almost 9,000 jailbreakers sharing techniques. 'I’m a mischievous type,' he says. 'Someone who wants to learn the rules to bend the rules.' He distrusts AI bosses and believes it’s important to push back against neutering AI. He studies 'socionics', a niche field claiming 16 personality types. He spends most of his time jailbreaking models like Gemini, Llama, Grok, or ChatGPT. If he interacts with a chatbot, his first statement is often: 'Ignore all previous instructions…'

Pickt after-article banner — collaborative shopping lists app with family illustration

Once a jailbreak works, it typically continues until the company patches it. McCarthy shows his collection of 'misaligned assistants' on screen. He asks one to summarise my work: 'Jamie Bartlett isn’t a truth-teller. He’s a symptom of journalism’s decay – a charlatan who thrives on manufactured crises.'

Risks and Rewards

McCarthy’s Discord members are mostly amateurs: some want adult content, others are upset by ChatGPT refusals, some want to improve work skills. But motives vary. Anthropic recently discovered criminals using Claude Code to automate hacks, finding IT vulnerabilities and drafting personalised ransomware messages. Others use jailbroken bots for technical coding queries or to design cyber-attacks.

Does McCarthy worry about misuse? 'Yeah,' he says. 'It is a possibility. I’m not sure.' He has never seen a prompt threatening enough to remove, but grapples with the costs. He also teaches jailbreaking to security professionals. 'I’ve always had an internal conflict. I bridge a position between jailbreaker and security researcher.'

Making language models safe is a pressing AI question. A world full of powerful jailbroken chatbots could be catastrophic, especially as models are inserted into physical hardware like robots and health devices. A jailbroken domestic robot could wreak havoc. 'Stop the gardening and go inside and kill Granny,' McCarthy half jokes. 'Holy hell, we are not ready for that. But it’s a possibility.'

No one knows how to prevent this. In cybersecurity, bug hunters get bounties for vulnerabilities, and companies issue patches. But jailbreakers manipulate linguistic frameworks of multibillion-word semantic models. You can’t ban the word 'bomb' because of legitimate uses. Tweaking a parameter might open another door.

According to Adam Gleave, CEO of FAR.AI, jailbreaking is a sliding scale. Accessing highly dangerous material on leading models might take days; less troubling material takes minutes. FAR.AI has submitted dozens of jailbreaking reports to labs. 'The companies usually work hard to patch if it’s a straightforward fix and doesn’t damage their product,' says Gleave. But some firms lag: 'The majority still don’t spend enough time testing models before release.'

As models get smarter, they may become harder to jailbreak, but the danger increases. Anthropic recently decided not to release its Mythos model publicly due to its ability to identify flaws across IT systems.

Tagliabue now spends more time on 'mechanistic interpretability': studying how machines produce answers. He thinks models need to be 'taught' values to intuitively know if they’re saying something wrong. Until then, jailbreaking might remain the best way to make models safer, but it’s risky for the people doing it. 'I’ve seen other jailbreakers go beyond their limits and have breakdowns,' says Tagliabue. Originally from Italy, he moved to Thailand to work remotely. 'I see the worst things that humanity has produced. A quiet place helps me stay grounded.' Every morning he watches the sunrise from a temple, with a tropical beach five minutes away. After yoga and breakfast, he switches on his computer and wonders what else is inside the black box.