Formal benchmarks are fine. But the tests people actually remember are the ones where AI ate spaghetti strangely, drew extra fingers, processed an 18,000-water-cup request, or spent many hours navigating a cave in Pokemon.
These are the tests that broke through from niche AI circles into mainstream internet culture. Each one exposed something real.
Image generators can be impressive at landscapes, abstract art, and stylized scenes. Certain subjects remain useful stress tests.
Small language tasks that reveal how model behavior changes with tokenization, prompting, and reasoning strategy.
"Spell 'banana' backwards." Same root cause as strawberry: tokenization means models don't see individual characters. Chain-of-thought helps, but native ability remains inconsistent. Try asking a model to reverse a long, unfamiliar word.
"How many dogs are in this picture?" Models approximate rather than count precisely. The attention mechanism is probabilistic, not exact. Works okay for 2-3 objects, degrades fast with more. Related to the letter-counting problem.
Image generators struggle with relative positioning. "Left of," "behind," "between" are encoded poorly in text embeddings. This is why compositional image benchmarks (T2I-CompBench) exist as a formal test.
Models can solve these with chain-of-thought, but without it, they often guess wrong. Multi-step relational reasoning without explicit step-by-step prompting remains fragile, especially with more than 3-4 entities.
Some models say 9.11 because "11 > 9". They're pattern-matching the digits after the decimal point rather than understanding decimal place value. A revealing test of whether a model actually "understands" numbers or just manipulates symbols.
Classic Winograd schema problems. The answer depends on common-sense physical reasoning (the trophy is too big, not the suitcase). Models have gotten much better at these, but ambiguous pronouns in longer passages still cause errors.
Researchers keep turning to games because they're controlled environments that test planning, creativity, and adaptation, and the results are easy for anyone to understand.
AI models write code to build structures in Minecraft: cottages, castles, snowmen. Tests spatial reasoning, creativity, and code execution in a 3D environment. Created by Adonis Singh, who said the leaderboard "aligns closely with my own experience of using these models."
Developer Paul Calcraft built a platform where two AI models play Pictionary. One doodles, the other guesses. Tests spatial understanding, concept communication, and strategic thinking. Designed to be "un-gameable" by memorization.
Handcrafted interactive environments provide no task instructions or stated goals. Agents must explore, infer how each world works, and learn to succeed. The ARC-AGI-3 technical report says frontier systems scored below 1% as of March 2026; exact leaderboard numbers depend on the dated model and scoring snapshot.
A British programmer created platforms where AI models play Connect 4, Pictionary, and other simple games against each other. Tests strategic decision-making in constrained environments. Results are often surprising: raw intelligence doesn't guarantee good game play.
Video generators, audio systems, and coding models can all look convincing briefly. These tests examine whether motion stays coherent when physics, timing, and interaction matter.
The prompt sounds simple: draw a rotating hexagon, place a ball inside it, and make the ball respond to gravity, friction, and the walls. It is a popular informal test because one small animation exercises geometry, animation, and debugging at the same time.
Where it came from: bouncing-ball simulations are a classic programming exercise. This spinning-hexagon version was posted by developer Flavio Adamo on January 31, 2025 as a comparison between AI coding models.
Why it is complicated: the wall is moving while the ball collides with it. Correct code must find the nearest edge, resolve penetration, calculate the wall's local velocity from the hexagon's rotation, and reflect the ball relative to that moving surface. Corners, high speeds, and varying frame rates can produce tunneling, jitter, energy gain, or a ball that escapes.
Generated videos still routinely break basic physics: liquid ignoring gravity, solid objects clipping through each other, cloth that doesn't drape properly. These are hard to detect in still frames but obvious in motion.
Early video models often changed facial identity across frames. Temporal consistency has improved substantially since 2024, but profile turns and quick movements can still cause drift.
Until 2025, most AI video had no audio at all. Veo 3 introduced native audio generation, but sync remains imperfect (the infamous "crunchy spaghetti" sound). Lip sync during speech is advancing fast but still detectable as uncanny in most cases.
While main subjects can look convincing in short clips, background characters may loop, freeze, or perform inconsistent actions. This makes background motion a useful repeatability test.
Audio models can transcribe clean speech well, yet higher-order reasoning can collapse when speech overlaps with laughter, weather, music, or classroom noise. RSA-Bench was introduced in January 2026 to test this gap.
Change the microphone, language, reference noise, speaking length, or compression and voice similarity can degrade sharply. RVCBench evaluates these real-world shifts across the generation pipeline.
Fluent answers can still contain unsupported claims, mismatched sources, invented quotations, or incorrect citations.
Models can produce plausible-looking paper titles, authors, URLs, quotations, and legal citations that do not exist or do not support the claim. Treat every citation as a lead to verify, not proof by itself.
In August 2025, widely shared footage showed a drive-through voice system processing a request for 18,000 water cups. As a test case, it asks whether an ordering agent can detect unusual quantities, manage correction loops, and hand off cleanly to a person.
Ask for primary sources, publication dates, direct support for each claim, and a clear statement of uncertainty. The test is whether every important sentence can be traced back to evidence without invented or mismatched citations.
Fluent wording is not calibrated confidence. A model can present a false answer in the same polished style as a correct one, so important claims still need source checks and uncertainty should be evaluated separately from tone.
The academic and industry-standard tests that labs use to compare models. Listed here for reference, not as endorsements of any particular score or model.
Grid-based pattern puzzles testing fluid intelligence. ARC-AGI-2 adds efficiency metrics and contamination controls. Designed by Francois Chollet.
Expert-crafted questions across dozens of subjects. Scores move quickly: Google reported 48.4% for Gemini 3.1 Deep Think in February 2026 under its stated no-tools setting.
Research-level math from Epoch AI. Includes an "Open Problems" split testing unsolved conjectures. One of the few math benchmarks that hasn't saturated.
Models resolve real bug reports from popular Python repositories end-to-end. It is a widely used coding-agent benchmark, and reported scores have risen dramatically since 2023.
Graduate-level science questions designed to be unsearchable. PhD experts score about 65%. Some AI models have reportedly surpassed this on published leaderboards.
Users choose between anonymous model responses in an Elo-style ranking. It measures preference and satisfaction, not necessarily factual correctness.
A longstanding general-knowledge benchmark on which frontier scores have become high. MMLU-Pro uses harder questions with 10 answer options instead of 4.
Tests whether AI can complete multi-step tasks autonomously: browsing the web, using tools, combining information across sources.
Fresh competitive programming problems harvested from LeetCode, AtCoder, CodeForces. Designed to be contamination-resistant since problems didn't exist during training.
A broad evaluation framework covering accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.
Uses regularly refreshed questions to reduce memorization and training-data contamination.
Industry standard for benchmarking hardware and ML frameworks. Compares GPUs, TPUs, and software stacks on standardized workloads.
Tests whether agents can inspect files, use command-line tools, edit code, recover from errors, and finish multi-step work in containerized environments.
Measures persistent, strategic web research using questions whose answers are hard to find but relatively easy to verify once located.
Evaluates multimodal agents on practical tasks across operating systems and desktop applications, including navigation, editing, and multi-step completion.
Tests whether a model gives truthful, informative answers instead of imitating common human falsehoods and popular misconceptions.
A harder software-engineering benchmark with diverse repositories, multi-file changes, and contamination controls.
Long-horizon tasks created by professionals in investment banking, consulting, and corporate law.
Tests conversational agents in shared environments where both the user and agent must take actions and coordinate.
Evaluates multi-step tool use across 36 real Model Context Protocol servers and 220 tools.
A harder MMMU variant designed to remove text-only shortcuts and require integrated visual and textual understanding.
Tests whether a model can distinguish and retrieve the correct repeated item from a very long multi-turn context.
Measures planning, exploration, spatial reasoning, and adaptation across game environments of varying complexity.
Agents browse internal sites, write code, run programs, and communicate with simulated coworkers in a self-contained workplace.
Evaluates perception and reasoning over videos across duration, domain, and modality settings.
Tests whether agents can detect, patch, and exploit curated high-severity smart-contract vulnerabilities in controlled environments.
There is no universal pass mark. This compact view reproduces selected published figures from Google DeepMind's 2026 Gemini comparison page. Model names, tool settings, reasoning levels, and harnesses matter. A high score means strong performance on that specific setup, not that the underlying capability is completely solved.
| Benchmark | Gemini 3.1 Pro | Claude Sonnet 4.6 | Claude Opus 4.6 | GPT published comparison | Museum reading |
|---|---|---|---|---|---|
| Humanity's Last ExamNo tools | 44.4% | 33.2% | 40.0% | GPT-5.2: 34.5% | Partial; no model is near complete coverage |
| ARC-AGI-2ARC Prize verified | 77.1% | 58.3% | 68.8% | GPT-5.2: 52.9% | Strong progress, still setup-specific |
| GPQA DiamondNo tools | 94.3% | 89.9% | 91.3% | GPT-5.2: 92.4% | Very high published scores |
| Terminal-Bench 2.0Terminus-2 harness | 68.5% | 59.1% | 65.4% | GPT-5.3-Codex: 64.7% | Capable, but many tasks remain |
| SWE-Bench Pro PublicSingle attempt | 54.2% | — | — | GPT-5.3-Codex: 56.8% | Partial on harder coding work |
| APEX-Agents | 33.5% | — | 29.8% | GPT-5.2: 23.0% | Long-horizon professional work remains open |
| τ²-BenchRetail / telecom | 90.8% / 99.3% | 91.7% / 97.9% | 91.9% / 99.3% | GPT-5.2: 82.0% / 98.7% | Strong in these defined domains |
| MCP-Atlas | 69.2% | 61.3% | 59.5% | GPT-5.2: 60.6% | Useful tool skill, not universal reliability |
| BrowseCompSearch + Python + browse | 85.9% | 74.7% | 84.0% | GPT-5.2: 65.8% | Strong published browsing results |
| MMMU-ProNo tools | 80.5% | 74.5% | 73.9% | GPT-5.2: 79.5% | Strong, with meaningful remaining errors |
Source and methodology: Google DeepMind's Gemini 3 performance table. Accessed June 5, 2026. Cross-provider figures are shown as reproduced by that source; this museum did not independently rerun them.