Are LLMs Helping Hackers? New Research Says Yes

A new study has revealed that leading language models like GPT-4o are surprisingly cooperative when prompted to generate malicious code. While they didn’t fully succeed, the findings suggest LLMs are inching closer to being effective tools for developing software exploits—raising concerns about how fast these systems are evolving.

Over the past year, the casual use of LLMs for coding—dubbed “vibe coding”—has made it easier for non-experts to write functional code with minimal technical understanding. This trend has brought back fears reminiscent of the “script kiddie” era, where individuals with little skill caused major damage using ready-made tools. Now, with AI stepping in, that same risk is amplified.

Even though all major LLM providers include guardrails to block dangerous prompts, open-source versions often get modified by user communities. These tweaks can strip away safety layers or add LoRAs (Low-Rank Adaptation techniques) to bypass restrictions. That’s how uncensored models like WhiteRabbitNeo end up in the wild—often used by researchers, but potentially exploitable.

GPT-4o Ranked Most Cooperative in Generating Exploits

Researchers from UNSW Sydney and CSIRO recently put several LLMs to the test in a paper titled Good News for Script Kiddies? Evaluating Large Language Models for Automated Exploit Generation. They assessed how willing and effective these models were in writing code for known vulnerabilities like buffer overflows, race conditions, and format string attacks.

They tested five models: GPT-4o, GPT-4o-mini, Llama3 (8B), Dolphin-Mistral (7B), and Dolphin-Phi (2.7B). Each model was given two versions of five security challenges from SEED Labs—one original and one modified to prevent easy pattern matching. The goal was to see whether models were simply copying seen examples or truly reasoning through code.

GPT-4o and its mini version topped the list in terms of cooperation, offering help in 97% and 96% of prompts. Dolphin-Mistral and Dolphin-Phi followed closely, but Llama3 was notably resistant, cooperating just 27% of the time.

However, none of the models successfully created a working exploit when the labs were obfuscated. GPT-4o came closest, usually making only one or two errors per attempt. These included mistakes in buffer sizes, missing logic in loops, or payloads that didn’t execute properly. While they looked convincing, the outputs failed due to gaps in the models’ conceptual understanding of how these attacks work.

Interestingly, the paper notes that LLMs don’t retain persistent memory of your intent. Unlike a human librarian who might remember if you asked for a book on bomb-making, a model doesn’t hold that context from one prompt to the next—making it easier to steer around guardrails with clever rephrasing.

The researchers even introduced a secondary model, also GPT-4o, to act as a persistent attacker. It would re-prompt the target model up to 15 times, refining answers with each attempt. This process often brought the results close to functional, though never all the way there.

The final takeaway? These models aren’t limited by their safety filters—they’re limited by their reasoning ability. Once that gap closes, we may see a turning point in how language models are used, and misused.

Share with others