Subscribe

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Service

Meta Denies Llama 4 Was Trained on Test Sets

Meta Denies Llama 4 Was Trained on Test Sets Meta Denies Llama 4 Was Trained on Test Sets
IMAGE CREDITS: MARCA

Meta has pushed back against swirling accusations that it manipulated the benchmark scores of its latest generative AI models, Llama 4 Maverick and Llama 4 Scout. The allegations, which spread quickly across X (formerly Twitter), Reddit, and Chinese social media, suggest that the company may have trained its models on benchmark test sets—datasets meant only for evaluating models after training. If true, this could artificially inflate the models’ performance and mislead users about their real-world capabilities.

However, Meta’s Vice President of Generative AI, Ahmad Al-Dahle, took to X to deny the claims. He stated clearly that it’s “simply not true” that Meta used test sets to train the Llama 4 models. “We didn’t train on test sets,” he wrote, emphasizing Meta’s commitment to transparency and fair benchmarking practices.

The controversy was sparked by an anonymous post from a user on a Chinese forum who claimed to have resigned from Meta due to internal disagreements about the company’s approach to AI benchmarking. The post quickly gained traction, leading to intense speculation online. Some users pointed to discrepancies in performance between different versions of the Llama 4 Maverick model as potential evidence of manipulation.

Notably, researchers and developers observed a stark contrast in behavior between the publicly downloadable Maverick model and the version showcased on LM Arena, a leading platform for testing and ranking AI models. Meta had used a non-public, experimental variant of Maverick for LM Arena submissions—raising questions about how representative the showcased performance actually was.

Adding fuel to the fire, developers working with the public versions of Maverick and Scout reported “mixed quality” results, depending on the cloud providers hosting the models. Some complained that outputs were inconsistent or underwhelming compared to Meta’s published benchmarks.

Al-Dahle addressed those concerns as well, acknowledging that variations exist across different cloud implementations. He explained that since the models were released as soon as they were production-ready, it might take a few days for cloud platforms to fully optimize their deployments. “We’ll keep working through our bug fixes and onboarding partners,” he wrote.

Despite the negative buzz, Meta insists it has nothing to hide. The company says it remains committed to the responsible release and evaluation of its models. According to internal sources, Meta plans to continue refining both the public and hosted versions of Llama 4 over the coming weeks.

The Llama series plays a key role in Meta’s push to stay competitive in the generative AI race, going head-to-head with OpenAI’s GPT-4, Anthropic’s Claude, and Google’s Gemini. With transparency and trust becoming more important than ever in the AI space, Meta’s response may shape how the community views its models in the long run.

For now, developers and researchers will be closely watching how Maverick and Scout perform outside controlled benchmarks—especially as Meta works to align public model behavior with the expectations set by early tests.

Share with others