Post

LLM Benchmarks Explained: From Needle in a Haystack to SWE-bench

A complete guide to understanding how LLMs are tested. Learn what Needle in a Haystack, BrowseComp, Oolong, LongBench v2, MMLU, SWE-bench, and other benchmarks actually measure—and why it matters.

LLM Benchmarks Explained: From Needle in a Haystack to SWE-bench

When a company claims their LLM has a 1 million token context window, how do you know if it actually works? When GPT-5 claims 88% on MMLU, what does that even mean? Benchmarks are the answer—standardized tests that measure what LLMs can actually do. But understanding what they test and why they matter isn’t straightforward.

This guide breaks down the major LLM benchmarks, from long-context evaluations to coding tests, so you can interpret model comparisons with clarity.

Why Benchmarks Matter

Benchmarks serve four critical purposes:

  1. Comparison — Objective measurement between different models
  2. Progress tracking — How the field advances over time
  3. Capability discovery — What LLMs can and cannot do well
  4. Development guidance — Where researchers should focus improvements

Without benchmarks, claims about model performance would be subjective and unverifiable. With them, we can say definitively that Gemini 1.5 Pro retrieves information at 99.7% accuracy up to 1M tokens, or that even frontier models score below 50% on Oolong at 128K context.

Long-Context Benchmarks

These test whether models can actually use their advertised context windows.

Needle in a Haystack

What it tests: Finding specific information buried in large amounts of text.

How it works:

  • A specific piece of information (the “needle”) is embedded within irrelevant text (the “haystack”)
  • The original test hid the statement “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day” within Paul Graham essays
  • The model must retrieve this information when asked

Key variables:

  • Context length — From 1K tokens up to 1M+ tokens
  • Needle depth — Beginning, middle, or end of document

Why it matters: A model might advertise 128K context, but if it can’t retrieve information from the middle, that context window is marketing, not capability. Many models suffer from the “lost in the middle” problem—degraded recall for information placed in the center of long documents.

Performance: Gemini 1.5 Pro achieves >99.7% recall up to 1M tokens across text, video, and audio modalities.

Oolong

What it tests: Long-context reasoning requiring both individual analysis AND aggregation across the entire context.

How it works:

  • OOLONG-synth — Synthetic tasks where reasoning components can be isolated
  • OOLONG-real — Real Dungeons & Dragons episode transcripts (50K-175K tokens each, up to 1.3M for multi-episode)

Tasks require:

  • Classifying individual items
  • Counting occurrences
  • Temporal reasoning
  • Aggregating results across the full context

Key insight: Each individual subtask is easy. The challenge is breaking down the problem and correctly aggregating all results.

Why it matters: Many real-world tasks require summarizing patterns across large amounts of data—not just finding a single fact. Oolong tests this aggregation capability.

Performance: Even GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro achieve less than 50% accuracy at 128K context. The bottleneck is aggregation, not individual labeling.

LongBench v2

What it tests: Deep understanding and reasoning across realistic long-context tasks.

How it works:

  • 503 challenging multiple-choice questions
  • Context lengths from 8K to 2M words
  • Six task categories:
    1. Single-document QA
    2. Multi-document QA
    3. Long in-context learning
    4. Long-dialogue history understanding
    5. Code repository understanding
    6. Long structured data understanding

Quality bar: Data collected from ~100 highly educated individuals. Questions validated so that human experts with search tools achieve only 53.7% accuracy under 15-minute constraints.

Why it matters: Unlike simple retrieval, LongBench v2 requires genuine understanding and reasoning.

Performance: Best models achieve 50.1% with direct answering. o1-preview with extended reasoning hits 57.7%, barely surpassing the human baseline.

Agentic Benchmarks

These test models operating as autonomous agents.

BrowseComp (Browseable Comprehension)

What it tests: Finding hard-to-find, entangled information through web browsing.

How it works:

  • 1,266 questions requiring persistent web navigation
  • Questions designed so humans can’t solve them within 10 minutes
  • Existing models (GPT-4o, ChatGPT with browsing) fail on them
  • Answers are short and easily verifiable

Design principle: “Asymmetry of verification”—questions hard to solve but easy to verify make reliable benchmarks.

Why it matters: Real web research involves backtracking, reformulating queries, and synthesizing information from multiple sources. Simple QA benchmarks are already saturated.

Performance:

  • GPT-4o: ~0.6% accuracy
  • GPT-4o with browsing: ~1.9% accuracy
  • OpenAI Deep Research: ~50% accuracy

Browsing alone isn’t enough—models need strategic reasoning about how to search.

General Knowledge & Reasoning Benchmarks

MMLU (Massive Multitask Language Understanding)

What it tests: Knowledge and problem-solving across 57 academic subjects.

How it works:

  • 15,908 multiple-choice questions (4 choices each)
  • Subjects from STEM to humanities to professional fields
  • Zero-shot or few-shot settings—model must answer without task-specific fine-tuning

Historical context:

  • When released: Most models scored ~25% (random chance)
  • GPT-3: 43.9%
  • 2024 frontier models: ~88% (approaching human expert level of 89.8%)

Current status: A 2024 analysis found significant errors—57% of virology questions had ground-truth mistakes. MMLU is now considered partially saturated, being phased out for MMLU-Pro.

HellaSwag

What it tests: Commonsense reasoning about physical situations.

How it works:

  • 10,000 sentences requiring completion
  • Model selects appropriate ending from 4 choices
  • Uses “Adversarial Filtering”—wrong answers contain expected words but violate common sense

Example: A video shows someone starting to make a sandwich. The model must predict the physically plausible next action, distinguishing it from grammatically correct but nonsensical alternatives.

Performance:

  • Humans: 95.6% accuracy
  • State-of-the-art in 2019: <48%
  • GPT-4 (2023): 95.3%
  • Most open models today: ~80%

Why it matters: Tests whether models understand physical causality and real-world plausibility—not just language patterns.

Coding Benchmarks

HumanEval

What it tests: Code generation through functional correctness.

How it works:

  • 164 hand-crafted programming problems
  • Each includes: function signature, docstring, body template, and ~7.7 unit tests
  • Generated code must pass the tests, not just look correct

The pass@k metric:

  • pass@1: First generated solution must pass
  • pass@10: Any of first 10 solutions can pass

This reflects real developer workflow—generate, test, iterate.

Why it matters: Unlike text-similarity metrics, functional correctness ensures code actually works.

Limitations: Problems are relatively simple algorithmic puzzles. Doesn’t capture real-world complexity like navigating large codebases.

SWE-bench

What it tests: Real-world software engineering on actual GitHub issues.

How it works:

  • Model receives: a code repository + an issue description
  • Model must: generate a patch resolving the issue
  • Evaluation: Apply patch, run repository’s test suite

Dataset:

  • 2,294 unique problems from 12 popular Python repositories
  • SWE-bench Verified: 500 human-validated, high-quality tasks
  • SWE-bench Multilingual: 300 tasks across 9 programming languages

Why it matters: Much closer to real software engineering than HumanEval. Requires:

  • Navigating large codebases
  • Understanding file interactions
  • Identifying subtle bugs
  • Writing patches that integrate with existing code

Current concern: Possible data contamination—models may have memorized patterns from training on the same GitHub repos. Models achieve 76% accuracy identifying buggy files on SWE-bench repos but only 53% on repositories not in the benchmark.

Benchmark Categories at a Glance

Category Benchmarks What They Test
Long Context Needle in Haystack, Oolong, LongBench v2 Information retrieval and reasoning over large documents
Web/Agentic BrowseComp Multi-step web navigation and research
General Knowledge MMLU, MMLU-Pro Academic knowledge across subjects
Reasoning HellaSwag, ARC, DROP Commonsense and logical reasoning
Coding HumanEval, SWE-bench Code generation and software engineering

Key Takeaways

  • Benchmarks become saturated — MMLU went from 25% to 88% as models improved. Once saturated, benchmarks no longer differentiate capabilities.

  • No single benchmark tells the whole story — Each tests specific capabilities. Real-world performance requires combining many skills.

  • “Benchmaxxing” is a risk — Some models may be optimized specifically for benchmarks rather than general capability. Watch for suspiciously high scores on specific tests.

  • Functional correctness > surface similarity — For code, passing tests matters more than looking right. HumanEval’s pass@k metric reflects this.

  • Long context ≠ useful context — Having a 1M token context means nothing if the model can’t retrieve or reason over it. Needle in a Haystack and Oolong expose these gaps.

  • Aggregation is harder than retrieval — Oolong shows that even frontier models struggle when they need to count, summarize, or reason across many pieces of information—not just find one.


Last updated: January 2026

Resources:

This post is licensed under CC BY 4.0 by the author.