Llama 3.1 In-Depth Analysis: Cutting Through the Noise

Llama 3.1 In-Depth Analysis: Cutting Through the Noise

The Llama 3.1 405B model generated a lot of buzz, but it didn’t fully register with us until Andrej Karpathy’s tweet brought it to our attention. As a co-founder of OpenAI with deep roots in AI safety, he's worth following closely.

His mention of Llama 3.1 prompted us to explore this model further. The paper is a whopping 92 pages—no easy feat. Llama 3.1 comes in three sizes, but we will focus primarily on the largest and most powerful: the 405-billion parameter model. Meta wasn’t exaggerating when they claimed it rivals top-tier language models like GPT-4.

If you’re curious, you can try Llama 3.1 for free at Hugging Face: Hugging Chat.

Benchmark Analysis

Here is how Llama 3.1 405B compares to GPT-4, GPT-4o, and Claude 3.5 Sonnet on traditional benchmarks.

Here is how Gemma 2 9B IT, Mistral 7B Instruct, Llama 3.1 70B, Mixtral 8x22b Instruct, and GPT 3.5 Turbo are on traditional benchmarks.

Benchmark Comparison in Numbers - LLaMA 3.1 (405B) vs. Other LLMs:


General Performance

  • MMLU (0-shot, CoT): LLaMA 3.1 (405B) scores 88.6, outperforming most models, including GPT-4 (85.4) and Nemotron 4 (78.7). GPT-4 Omni and Claude 3.5 Sonnet score slightly higher at 88.7 and 88.3, respectively.

  • MMLU PRO (5-shot, CoT): LLaMA 3.1 (405B) scores 73.3, surpassing Nemotron 4 (62.7) and GPT-4 (64.8). Claude 3.5 Sonnet scores the highest at 77.0.

  • IFEval: LLaMA 3.1 (405B) scores 88.6, leading Nemotron 4 (85.1) and GPT-4 (84.3). Claude 3.5 Sonnet edges slightly higher with a score of 88.0.

Code Generation

  • HumanEval (0-shot): LLaMA 3.1 (405B) scores 89.0, surpassing GPT-4 (86.6) and Nemotron 4 (80.9). Claude's 3.5 Sonnet scores the highest at 92.0.

  • MBPP EvalPlus (base) (0-shot): LLaMA 3.1 (405B) scores 88.6, outperforming Nemotron 4 (72.8) and GPT-4 (83.6). Claude 3.5 Sonnet takes the lead with 90.5.

Mathematics

  • GSM8K (8-shot, CoT): LLaMA 3.1 (405B) scores 96.8, leading Nemotron 4 (92.3) and GPT-4 (94.2). Claude 3.5 Sonnet matches GPT-4 Omni with a score of 96.4.

  • MATH (0-shot, CoT): LLaMA 3.1 (405B) scores 73.8, ahead of Nemotron 4 (41.1) and GPT-4 (64.5). Claude 3.5 Sonnet scores 71.1.

Reasoning

  • ARC Challenge (0-shot): LLaMA 3.1 (405B) scores 96.9, outperforming Nemotron 4 (94.6) and GPT-4 (96.4). GPT-4 Omni and Claude 3.5 Sonnet lead with 96.7 each.

  • GPQA (0-shot, CoT): LLaMA 3.1 (405B) scores 51.1, significantly higher than GPT-4 (41.4) and Claude 3.5 Sonnet (59.4).

Tool Use

  • BFCL: LLaMA 3.1 (405B) scores 88.5, slightly lower than GPT-4 Omni (88.3) and Claude 3.5 Sonnet (90.2).

  • Nexus: LLaMA 3.1 (405B) scores 58.7, leading GPT-4 (50.3) and Nemotron 4 (45.7). GPT-4 Omni performs better with 56.1.

Long Context

  • ZeroSCROLLS/QuALITY: LLaMA 3.1 (405B) scores 95.2, on par with GPT-4 Omni and Claude 3.5 Sonnet, all scoring 90.5.

  • InfiniteBench/En.MC: LLaMA 3.1 (405B) scores 83.4, leading GPT-4 (72.1) but lower than GPT-4 Omni (82.5).

  • NIH/Multi-needle: LLaMA 3.1 (405B) scores 98.1, outperforming GPT-4 Omni and GPT-4 Omni (100.0), and Claude 3.5 Sonnet with 90.8.

Multilingual

  • Multilingual MGSM (0-shot): LLaMA 3.1 (405B) scores 91.6, surpassing Nemotron 4 (85.9) and GPT-4 (85.9). GPT-4 Omni and Claude 3.5 Sonnet both perform slightly higher with 90.5.

While these benchmarks don’t fully capture the nuanced differences between the models, they show that Meta’s new "open-source" model is at least on par with GPT-4, if not superior. However, it lacks the advanced speech input and output features of GPT-4 Omni, which isn’t widely accessible yet.

We now have a downloadable model as good as—or better than—GPT-4, which many thought would take years to develop. Meta continues to argue that this series of models follows a "responsible path" toward artificial general intelligence (AGI). We remain skeptical about the "responsible" aspect.

Open-Source Claim

According to the official open-source initiative, true open-source AI must include details about the training data’s provenance. The Llama 3.1 paper, however, only mentions that the data comes "from a variety of data sources," which falls short of the true open-source definition.

We Created our dataset for pre-training, which involved collecting data from diverse sources, encompassing knowledge up to the end of 2023, as stated in the Llama 3.1 paper.

Even if another entity had the budget, it couldn’t replicate Llama 3.1 without knowing the exact data Meta used. Keep this in mind when Meta touts its commitment to open-source AI.

Although disappointed by the lack of transparency, we understand the challenges. According to a recent New York Times article, acquiring LLM training data is increasingly difficult. Platforms like Reddit and Twitter now charge for their data, and Meta likely didn’t have permission for all the data they used. But that’s a topic for another time.

AI Enhancing AI

One recurring theme in the paper is how language models enhance the performance of other language models. For instance, Llama 2 was utilized to filter the data for training Llama 3. This is just one example; there are many more, making it plausible that Llama 3.1 is contributing to the development of Llama 4.

“To train a quality classifier based on Llama 2, we created a dataset of cleaned web documents, outlined quality criteria, and instructed Llama 2’s chat model to evaluate whether the documents met these criteria.”

This approach could be seen as an intelligence explosion, but it appears to be a methodical process of using AI to refine AI. While this isn’t a novel concept—many new models employ similar techniques—Llama 3.1 does it differently. They trained a code expert model to identify the highest quality human annotations for code.

Scaling Laws for Performance

The paper offers insights into their process, down to the scaling laws they developed for next-token prediction loss and benchmark performance. Given their flop budget, they predicted how long it would take to run the GPUs to achieve the desired benchmark performance, and their predictions were remarkably accurate.

“This approach enables us to predict downstream task performance given a specific number of training flops for compute-optimal models.” according to the Llama 3.1 paper.

Infrastructure, Scaling, and Data

The level of detail regarding infrastructure, scaling, and data is impressive. Llama 3.1 405B was trained on 16,000 H100 GPUs, each running at 700W TDP with 80GB HBM3, utilizing Meta’s Grand Teton AI server platform. The paper also discusses challenges like network oversubscription, load balancing, and power grid constraints due to temperature fluctuations.

They meticulously cleaned the data, eliminating overly apologetic tones, excessive emojis, and other unwanted elements. We suspect a similar approach was taken when training Elon Musk’s Grok LLM by x.ai, which is known for its sassy attitude and lack of apologies.

Reasoning and Mathematics

For reasoning and mathematics, the paper defines reasoning as "the ability to perform multi-step computations and arrive at the correct final answer." However, they acknowledge a shortage of ground truth chains of thought in the training data, which are crucial for guiding the model through multi-step reasoning processes.

They also employed models like Llama 3 to verify reasoning steps, filtering out incorrect chains of thought in the training data. They used techniques like Monte Carlo research to generate valid reasoning traces for the most challenging prompts.

Adversarial Testing and Performance

Llama 405B performs well but falls short compared to Claude 3.5 Sonnet in text generation. The paper notes that adversarial tests significantly degrade performance, which suggests that the model isn’t truly "thinking" about the questions.

“For mathematical reasoning and question answering, however, the adversarial performances are substantially lower than the non-adversarial performances.” as stated in the Llama 3.1 paper.

Data Contamination Issues

The paper highlights significant data contamination in traditional benchmarks, which affects performance estimates. They found that contamination was a pervasive issue even with a higher threshold for word overlap between training data and test sets.

The contamination scores in the table actually downplay the issue. They omitted benchmarks from this chart when the clean set had too few examples or when the performance gains observed after cleaning the dataset were highly erratic. They also discussed the MMLU. Even when they set a higher threshold of an eight-word overlap between the training and test data, the contamination scores were so high that it was impossible to accurately estimate the performance gains. As a result, they couldn't properly gauge the extent to which contamination affected the MMLU scores.

Long-Context Capabilities

Llama 405B excels in long-context tasks, with a context length of 128k tokens—equivalent to about 100,000 words or 200 pages. It outperforms GPT-4, GPT-4o, and Claude 3.5 Sonnet in tasks requiring extensive context analysis.

None

They used "InfiniteBENCH," a benchmark designed to test language models in superlong contexts, and Llama 3.1 significantly outperformed its rivals.

data_pie.png

Compared to other benchmarks, InfinityBENCH is 10 times longer than traditional benchmarking datasets and covers diverse domains.

Human Comparisons and Transparency

The paper includes several head-to-head comparisons with GPT-4, where Llama 3.1 sometimes falls short. It’s commendable that Meta is transparent about these results, even when they aren’t favorable.

Safety Enhancements and Vulnerabilities

Meta claims Llama 3 has a significantly lower violation rate than its competitors, with minimal false refusals. They also admit Llama 3 is more susceptible to prompt injections than GPT-4 or Gemini Pro but performs better than models like Vicuna.

Pre-Release Rigor

Meta’s pre-release checks were thorough, with subject matter experts evaluating operational plans for potential vulnerabilities. Thanks to extensive data filtering, the analysis showed no significant uplift in the ability to create harmful content when using Llama 3.

"the operational plans generated by each team are evaluated by subject matter experts with domain expertise in biology, chemistry, and operational planning. Each plan is evaluated across four stages of potential attacks, generating scores for metrics such as scientific accuracy, detail, detection avoidance, and probability of success in scientific and operational execution."

Vision, Speech, and Video Capabilities

The paper discusses Llama 3.1’s vision, speech, and video capabilities, which aren’t yet available. Meta argues that a compositional approach, using separate models, might be more efficient during inference.

The final benchmark results show that Llama 3, with vision capabilities, is performing well but still trailing behind competitors like Claude 3.5 and GPT-4o in some areas.

They mention they are working on speech generation and understanding, so you should eventually be able to talk to Llama 3.1, similar to what was promised with GPT-4o.

Meta’s Conclusion and Future Outlook

Meta concludes by acknowledging that high-quality foundation models are still in their early stages, with substantial improvements on the horizon. They explored more complex architectures but found that the added complexity didn’t justify the marginal benefits.

They hope the release of Llama 3 will encourage the industry to embrace the open, responsible development of AGI. By avoiding overfitting on common benchmarks and using a separate team to process training data, they’ve ensured that Llama 3.1’s strong performance reflects actual capability rather than mere memorization.

Overall, Meta has done an impressive job with this model, and the prospect of Llama 4, which will require 10 times more computing power, is incredibly ambitious. Meta is poised to set new standards for open-source LLMs, and I’m eagerly anticipating what’s next. You can experience Llama 3.1 for yourself at Hugging Chat for free.