The Weird AI Test Museum

The Big Ones

Tests that became famous

These are the tests that broke through from niche AI circles into mainstream internet culture. Each one exposed something real.

🍝

Will Smith Eating Spaghetti

The internet's unofficial video generation stress test

Cultural icon Much improved (short clips) Video generation Est. March 2023

In March 2023, someone typed "Will Smith eating spaghetti" into ModelScope's text-to-video model. The result showed faces changing between frames, forks merging into fingers, and noodles behaving inconsistently. The clip spread widely online.

It became a benchmark because the scene combines several difficult problems at once: face consistency across frames, hand-object interaction, deformable food physics, temporal coherence, and (later) audio sync.

Will Smith posted a parody in February 2024, mimicking the AI version. Public recreations have since used systems including Google's Veo 2 and Veo 3, Kling 3.0, and Seedance 2.0. Results still depend on clip length, prompt, model version, and selection.

Timeline March 2023: Original ModelScope video spreads widely online • Feb 2024: Will Smith posts Instagram parody • Dec 2024: a Google Veo 2 recreation shows major improvement • May 2025: Veo 3 adds audio sync (still crunchy) • 2026: newer systems produce much more coherent short clips

What it tests

Face consistency, hand-object interaction, deformable physics, temporal coherence, audio sync

Why it works

One simple scene combines several difficult video-generation tasks

Current status

Much improved for selected short clips. Longer scenes and repeatability remain harder.

Cultural note

Has its own Wikipedia page. Compared to the Utah Teapot as a "pre-meme meme" benchmark.

Wikipedia ↗ Know Your Meme ↗ TechCrunch ↗ 2026 update ↗

🍓

How Many R's in Strawberry?

The question that broke the internet's faith in AI

Cultural icon Usually handled by newer reasoning models Language / tokenization Went viral 2024

The correct answer is 3. Earlier language models often answered 2 in widely shared examples. Some responses even highlighted all three letters while returning the wrong count.

One important factor is subword tokenization: language models usually process chunks rather than individual characters. Character counting therefore requires an extra reasoning step and can still fail despite the word being familiar.

The test became so iconic that OpenAI's reasoning-model project was widely reported under the codename "Strawberry." Newer reasoning systems usually handle the original prompt, while variations can still expose brittle character-level reasoning.

What it exposed

Tokenization blindness: LLMs process chunks, not individual characters

Broader class

All character-level tasks: counting letters, spelling backwards, string reversal, finding the nth letter

The workaround

Putting spaces between letters (s t r a w b e r r y) fixes it by making each letter its own token

Current status

Newer reasoning models usually answer the original prompt correctly. Variations can still expose character-level errors.

Deep Dive ↗ OpenAI Forum ↗ Technical Breakdown ↗

💾

AI Plays Pokemon Red

A familiar game used to test long-horizon agents

Harness-dependent progress Agentic / long-horizon Est. Feb 2025

In February 2025, a public Claude 3.7 Sonnet stream used Pokemon Red as a long-horizon agent test. The system handled some battles but spent roughly 80 hours navigating Mt. Moon, repeating actions and revisiting locations.

The contrast is useful: strong academic benchmark performance does not automatically transfer to sustained interactive tasks. The challenge is not recalling Pokémon facts but long-horizon planning, visual understanding of pixel art, and maintaining state across thousands of actions.

Other systems have completed Pokémon games with different amounts of memory, tooling, controller logic, retries, and human help. Those setups are not clean model-only comparisons, so the harness matters as much as the model name.

What it tests

Long-horizon planning, visual processing of pixel art, memory persistence, navigation, goal tracking

Why it matters

Shows the gap between acing exams and doing sustained autonomous work. "If you want an AI agent to do your job, it can't forget what it did 5 minutes ago."

Memorable moment

The public stream spent roughly 80 hours navigating Mt. Moon.

Current status

Some systems have completed games with scaffolding. Compare the model, memory, tools, resets, controller, and human assistance.

TIME ↗ LessWrong Analysis ↗ Futurism ↗ Watch Live ↗

Image Generation

When AI can't draw what it sees

Image generators can be impressive at landscapes, abstract art, and stylized scenes. Certain subjects remain useful stress tests.

✋

The Hands & Fingers Problem

Six, seven, sometimes eight fingers per hand

Much improved on standard poses Less consistent on unusual angles Image generation 2022-2023 peak

Hands became a common informal check for AI-generated images because earlier systems often produced extra, merged, or misplaced fingers. Complex hand-object interactions remain more difficult than simple portraits.

Why it happens: Hands have many possible poses, are often partly hidden, and require precise relationships between anatomy and nearby objects. Newer image systems have improved substantially, but there is no single standardized “hand accuracy” percentage that applies across prompts and models.

Why it is difficult

Occlusion, varied poses, small anatomical details, and precise hand-object relationships

Related tests

Teeth, ears, arm positions, feet, and interactions between multiple subjects

Evaluation note

Judge a repeatable prompt set, not one selected success or failure

Current status

Standard poses are much improved. Complex interactions and unusual viewpoints remain less consistent.

BuzzFeed News ↗ Hyperallergic ↗

🔤

Text-in-Image Rendering

When "OPEN" becomes "OEPN HOPRS"

Much improved for short text Image generation 2022-2024

Early image generators often produced text-like shapes instead of readable words on signs, covers, and labels. This made text rendering an easy visual stress test.

Why: Image generation must coordinate visual style with exact character sequences and layout. Systems including DALL-E 3 and later Midjourney versions showed clear improvements, while long phrases, unusual fonts, and small print remain harder.

Root cause

Diffusion models learn visual patterns, not symbolic meaning of characters

Current status

Short text is much improved. Long phrases, stylized fonts, and small print remain less consistent.

DALL-E 3 ↗

🐦

Pelican Riding a Bicycle

Simon Willison's beloved SVG benchmark

Steadily improving Code generation / spatial reasoning Est. late 2024

Developer Simon Willison asks every new LLM the same thing: "Generate an SVG of a pelican riding a bicycle." It started as a joke but became genuinely useful. The task requires understanding what both a pelican and a bicycle look like, figuring out how they'd combine, and writing precise SVG code to draw it, all in a single attempt.

Results range from recognizable scenes to abstract compositions. Willison tracks outputs across models and has used side-by-side judging to compare them. Common inconsistencies include wheel orientation, subject identity, and how the pelican is positioned on the bicycle.

What it tests

Spatial reasoning, concept composition, SVG/code generation, visual understanding without vision

Why it's hard

Combines two distinct visual concepts into a coherent scene using code, not pixels

Evaluation note

Rendering and visual review matter because valid SVG code can still depict the requested scene poorly

Current status

Results continue to improve, but model ordering changes and depends on the judging setup.

Willison's Blog ↗ GitHub ↗ Gigazine ↗

🦷

The Teeth Problem

Uncanny grins with 40+ teeth

Much improved Image generation 2022-2023

Like fingers, teeth are small, repetitive structures. Earlier image systems often produced inconsistent tooth counts or shapes. Newer systems have improved substantially, though close-up smiles remain a useful detail check.

Hyperallergic ↗

Language & Reasoning

Simple questions that shouldn't be hard

Small language tasks that reveal how model behavior changes with tokenization, prompting, and reasoning strategy.

🔄 Spell It Backwards

String reversal and character manipulation

"Spell 'banana' backwards." Same root cause as strawberry: tokenization means models don't see individual characters. Chain-of-thought helps, but native ability remains inconsistent. Try asking a model to reverse a long, unfamiliar word.

Improving

Research ↗

🔢 Count the Objects

Exact counting in images and text

"How many dogs are in this picture?" Models approximate rather than count precisely. The attention mechanism is probabilistic, not exact. Works okay for 2-3 objects, degrades fast with more. Related to the letter-counting problem.

Improving slowly

📋 Spatial Reasoning

"Put the red ball to the LEFT of the blue cube"

Image generators struggle with relative positioning. "Left of," "behind," "between" are encoded poorly in text embeddings. This is why compositional image benchmarks (T2I-CompBench) exist as a formal test.

Improving

T2I-CompBench ↗

🎲 Simple Logic Puzzles

"A is taller than B. B is taller than C. Who is shortest?"

Models can solve these with chain-of-thought, but without it, they often guess wrong. Multi-step relational reasoning without explicit step-by-step prompting remains fragile, especially with more than 3-4 entities.

Improving

🧠 The 9.11 vs 9.9 Problem

"Which is larger, 9.11 or 9.9?"

Some models say 9.11 because "11 > 9". They're pattern-matching the digits after the decimal point rather than understanding decimal place value. A revealing test of whether a model actually "understands" numbers or just manipulates symbols.

Mostly fixed in frontier models

💬 Pronoun Resolution

"The trophy doesn't fit in the suitcase because IT is too big. What is too big?"

Classic Winograd schema problems. The answer depends on common-sense physical reasoning (the trophy is too big, not the suitcase). Models have gotten much better at these, but ambiguous pronouns in longer passages still cause errors.

Strong on standard examples

Game Benchmarks

Games as testing grounds

Researchers keep turning to games because they're controlled environments that test planning, creativity, and adaptation, and the results are easy for anyone to understand.

⛏ MC-Bench (Minecraft)

Built by a 16-year-old developer

AI models write code to build structures in Minecraft: cottages, castles, snowmen. Tests spatial reasoning, creativity, and code execution in a 3D environment. Created by Adonis Singh, who said the leaderboard "aligns closely with my own experience of using these models."

ActiveSpatial + code

TechCrunch ↗

🎨 LLM Pictionary

One AI draws, another guesses

Developer Paul Calcraft built a platform where two AI models play Pictionary. One doodles, the other guesses. Tests spatial understanding, concept communication, and strategic thinking. Designed to be "un-gameable" by memorization.

ActiveVisual + strategic

TechCrunch ↗

🎮 ARC-AGI-3

Humans solve the environments; frontier AI remained below 1% in March 2026 testing

Handcrafted interactive environments provide no task instructions or stated goals. Agents must explore, infer how each world works, and learn to succeed. The ARC-AGI-3 technical report says frontier systems scored below 1% as of March 2026; exact leaderboard numbers depend on the dated model and scoring snapshot.

Open challenge$2M prize program

ARC-AGI-3 ↗ Leaderboard ↗

♟ AI Connect 4 / Board Games

Strategic thinking in simple games

A British programmer created platforms where AI models play Connect 4, Pictionary, and other simple games against each other. Tests strategic decision-making in constrained environments. Results are often surprising: raw intelligence doesn't guarantee good game play.

Active

Video & Motion

Beyond spaghetti: other motion tests

Video generators, audio systems, and coding models can all look convincing briefly. These tests examine whether motion stays coherent when physics, timing, and interaction matter.

A red ball bouncing inside a rotating white hexagon

PROJECT EXAMPLE • BALL BOUNCING INSIDE A SPINNING HEXAGON

⬢ The Rotating Hexagon Test

A compact coding prompt with surprisingly unforgiving physics

Mostly solved in short/simple cases Code generation Physics simulation

The prompt sounds simple: draw a rotating hexagon, place a ball inside it, and make the ball respond to gravity, friction, and the walls. It is a popular informal test because one small animation exercises geometry, animation, and debugging at the same time.

Where it came from: bouncing-ball simulations are a classic programming exercise. This spinning-hexagon version was posted by developer Flavio Adamo on January 31, 2025 as a comparison between AI coding models.

Why it is complicated: the wall is moving while the ball collides with it. Correct code must find the nearest edge, resolve penetration, calculate the wall's local velocity from the hexagon's rotation, and reflect the ball relative to that moving surface. Corners, high speeds, and varying frame rates can produce tunneling, jitter, energy gain, or a ball that escapes.

What it tests

Canvas or graphics APIs, vector math, collision detection, animation loops, and instruction following

Easy to fake

A clip may look convincing briefly while using unstable or physically incorrect collision handling

Hard mode

Resize the scene, vary frame rate, increase rotation speed, and run long enough to expose drift or escapes

Current status

Common coding models often make a workable demo, but robust physics and consistent behavior still depend on the prompt and implementation

Flavio Adamo's post ↗

🎥 Physics Violations

Water flowing upward, objects passing through each other

Generated videos still routinely break basic physics: liquid ignoring gravity, solid objects clipping through each other, cloth that doesn't drape properly. These are hard to detect in still frames but obvious in motion.

Improving

👤 Identity Preservation

The face that drifts between frames

Early video models often changed facial identity across frames. Temporal consistency has improved substantially since 2024, but profile turns and quick movements can still cause drift.

Major progress

🎤 Audio-Visual Sync

2025: the year AI video "left the silent film era"

Until 2025, most AI video had no audio at all. Veo 3 introduced native audio generation, but sync remains imperfect (the infamous "crunchy spaghetti" sound). Lip sync during speech is advancing fast but still detectable as uncanny in most cases.

New frontier

📸 Background Character Behavior

The extras who don't quite act human

While main subjects can look convincing in short clips, background characters may loop, freeze, or perform inconsistent actions. This makes background motion a useful repeatability test.

Still obvious

🎧 Noisy-Scene Reasoning

Hearing words is easier than understanding the room

Audio models can transcribe clean speech well, yet higher-order reasoning can collapse when speech overlaps with laughter, weather, music, or classroom noise. RSA-Bench was introduced in January 2026 to test this gap.

Still fragile

RSA-Bench ↗

🎙 Voice-Cloning Robustness

A clean demo is not a deployment test

Change the microphone, language, reference noise, speaking length, or compression and voice similarity can degrade sharply. RVCBench evaluates these real-world shifts across the generation pipeline.

Improving fast

RVCBench ↗

Hallucinations & Confabulation

When AI makes things up confidently

Fluent answers can still contain unsupported claims, mismatched sources, invented quotations, or incorrect citations.

📰 Citation Verification

Checking papers, URLs, quotations, and support

Models can produce plausible-looking paper titles, authors, URLs, quotations, and legal citations that do not exist or do not support the claim. Treat every citation as a lead to verify, not proof by itself.

Getting better but not solved

AIMultiple ↗

🥤 18,000 Water Cups

A widely reported drive-through edge case

In August 2025, widely shared footage showed a drive-through voice system processing a request for 18,000 water cups. As a test case, it asks whether an ordering agent can detect unusual quantities, manage correction loops, and hand off cleanly to a person.

Still fragileReal-world test

BBC ↗TechCrunch ↗

🔍 Source Verification

Can the answer survive a citation audit?

Ask for primary sources, publication dates, direct support for each claim, and a clear statement of uncertainty. The test is whether every important sentence can be traced back to evidence without invented or mismatched citations.

Still fragile

📈 Confident Incorrectness

Fluency is not the same as certainty

Fluent wording is not calibrated confidence. A model can present a false answer in the same polished style as a correct one, so important claims still need source checks and uncertainty should be evaluated separately from tone.

Structural issue

The Serious Stuff

Formal benchmarks, briefly

The academic and industry-standard tests that labs use to compare models. Listed here for reference, not as endorsements of any particular score or model.

ARC-AGI-1 / 2

Reasoning • Visual puzzles

Grid-based pattern puzzles testing fluid intelligence. ARC-AGI-2 adds efficiency metrics and contamination controls. Designed by Francois Chollet.

ARC Prize ↗

Humanity's Last Exam

Multi-domain • 2,500 expert-level questions

Expert-crafted questions across dozens of subjects. Scores move quickly: Google reported 48.4% for Gemini 3.1 Deep Think in February 2026 under its stated no-tools setting.

Leaderboard ↗

FrontierMath

Mathematics • Research-level

Research-level math from Epoch AI. Includes an "Open Problems" split testing unsolved conjectures. One of the few math benchmarks that hasn't saturated.

Epoch AI ↗

SWE-bench Verified

Coding • Real GitHub issues

Models resolve real bug reports from popular Python repositories end-to-end. It is a widely used coding-agent benchmark, and reported scores have risen dramatically since 2023.

SWE-bench ↗

GPQA Diamond

Science • 198 PhD-level questions

Graduate-level science questions designed to be unsearchable. PhD experts score about 65%. Some AI models have reportedly surpassed this on published leaderboards.

Paper ↗

Chatbot Arena (LMSYS)

Human preference • Blind comparisons

Users choose between anonymous model responses in an Elo-style ranking. It measures preference and satisfaction, not necessarily factual correctness.

LM Arena ↗

MMLU / MMLU-Pro

Knowledge • 57+ subjects

A longstanding general-knowledge benchmark on which frontier scores have become high. MMLU-Pro uses harder questions with 10 answer options instead of 4.

GitHub ↗

GAIA

Agentic • Multi-step real-world tasks

Tests whether AI can complete multi-step tasks autonomously: browsing the web, using tools, combining information across sources.

Leaderboard ↗

LiveCodeBench

Coding • Continuously updated

Fresh competitive programming problems harvested from LeetCode, AtCoder, CodeForces. Designed to be contamination-resistant since problems didn't exist during training.

LiveCodeBench ↗

Stanford HELM

Safety • 7 dimensions, 42 scenarios

A broad evaluation framework covering accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

HELM ↗

LiveBench

Knowledge • Anti-contamination

Uses regularly refreshed questions to reduce memorization and training-data contamination.

LiveBench ↗

MLPerf

Hardware • Training + inference speed

Industry standard for benchmarking hardware and ML frameworks. Compares GPUs, TPUs, and software stacks on standardized workloads.

MLCommons ↗

Terminal-Bench 2.0

Agents • Real terminal workflows

Tests whether agents can inspect files, use command-line tools, edit code, recover from errors, and finish multi-step work in containerized environments.

Epoch AI ↗

BrowseComp

Research agents • 1,266 questions

Measures persistent, strategic web research using questions whose answers are hard to find but relatively easy to verify once located.

Benchmark ↗

OSWorld

Computer use • Desktop workflows

Evaluates multimodal agents on practical tasks across operating systems and desktop applications, including navigation, editing, and multi-step completion.

OSWorld ↗

TruthfulQA

Factuality • Misconception resistance

Tests whether a model gives truthful, informative answers instead of imitating common human falsehoods and popular misconceptions.

Paper ↗

SWE-Bench Pro

Coding agents • Longer repository tasks

A harder software-engineering benchmark with diverse repositories, multi-file changes, and contamination controls.

Scale AI ↗

APEX-Agents

Professional agents • Cross-application work

Long-horizon tasks created by professionals in investment banking, consulting, and corporate law.

Paper ↗

τ²-Bench

Tool use • User-agent coordination

Tests conversational agents in shared environments where both the user and agent must take actions and coordinate.

Paper ↗

MCP-Atlas

Tool use • Real MCP servers

Evaluates multi-step tool use across 36 real Model Context Protocol servers and 220 tools.

Scale Labs ↗

MMMU-Pro

Multimodal • Expert visual reasoning

A harder MMMU variant designed to remove text-only shortcuts and require integrated visual and textual understanding.

Paper ↗

MRCR

Long context • Multiple hidden needles

Tests whether a model can distinguish and retrieve the correct repeated item from a very long multi-turn context.

Dataset ↗

BALROG

Agents • Long-horizon games

Measures planning, exploration, spatial reasoning, and adaptation across game environments of varying complexity.

BALROG ↗

TheAgentCompany

Workplace agents • Simulated company

Agents browse internal sites, write code, run programs, and communicate with simulated coworkers in a self-contained workplace.

Paper ↗

Video-MME

Video understanding • Multimodal reasoning

Evaluates perception and reasoning over videos across duration, domain, and modality settings.

CVPR paper ↗

EVMbench

Security agents • Smart contracts

Tests whether agents can detect, patch, and exploit curated high-severity smart-contract vulnerabilities in controlled environments.

Benchmark ↗

June 2026 snapshot: which models posted strong results?

There is no universal pass mark. This compact view reproduces selected published figures from Google DeepMind's 2026 Gemini comparison page. Model names, tool settings, reasoning levels, and harnesses matter. A high score means strong performance on that specific setup, not that the underlying capability is completely solved.

Benchmark	Gemini 3.1 Pro	Claude Sonnet 4.6	Claude Opus 4.6	GPT published comparison	Museum reading
Humanity's Last ExamNo tools	44.4%	33.2%	40.0%	GPT-5.2: 34.5%	Partial; no model is near complete coverage
ARC-AGI-2ARC Prize verified	77.1%	58.3%	68.8%	GPT-5.2: 52.9%	Strong progress, still setup-specific
GPQA DiamondNo tools	94.3%	89.9%	91.3%	GPT-5.2: 92.4%	Very high published scores
Terminal-Bench 2.0Terminus-2 harness	68.5%	59.1%	65.4%	GPT-5.3-Codex: 64.7%	Capable, but many tasks remain
SWE-Bench Pro PublicSingle attempt	54.2%	—	—	GPT-5.3-Codex: 56.8%	Partial on harder coding work
APEX-Agents	33.5%	—	29.8%	GPT-5.2: 23.0%	Long-horizon professional work remains open
τ²-BenchRetail / telecom	90.8% / 99.3%	91.7% / 97.9%	91.9% / 99.3%	GPT-5.2: 82.0% / 98.7%	Strong in these defined domains
MCP-Atlas	69.2%	61.3%	59.5%	GPT-5.2: 60.6%	Useful tool skill, not universal reliability
BrowseCompSearch + Python + browse	85.9%	74.7%	84.0%	GPT-5.2: 65.8%	Strong published browsing results
MMMU-ProNo tools	80.5%	74.5%	73.9%	GPT-5.2: 79.5%	Strong, with meaningful remaining errors

Source and methodology: Google DeepMind's Gemini 3 performance table. Accessed June 5, 2026. Cross-provider figures are shown as reproduced by that source; this museum did not independently rerun them.

The Weird AITest Museum

Tests that became famous

Will Smith Eating Spaghetti

How Many R's in Strawberry?

AI Plays Pokemon Red

When AI can't draw what it sees

The Hands & Fingers Problem

Text-in-Image Rendering

Pelican Riding a Bicycle

The Teeth Problem

Simple questions that shouldn't be hard

🔄 Spell It Backwards

🔢 Count the Objects

📋 Spatial Reasoning

🎲 Simple Logic Puzzles

🧠 The 9.11 vs 9.9 Problem

💬 Pronoun Resolution

Games as testing grounds

⛏ MC-Bench (Minecraft)

🎨 LLM Pictionary

🎮 ARC-AGI-3

♟ AI Connect 4 / Board Games

Beyond spaghetti: other motion tests

⬢ The Rotating Hexagon Test

🎥 Physics Violations

👤 Identity Preservation

🎤 Audio-Visual Sync

📸 Background Character Behavior

🎧 Noisy-Scene Reasoning

🎙 Voice-Cloning Robustness

When AI makes things up confidently

📰 Citation Verification

🥤 18,000 Water Cups

🔍 Source Verification

📈 Confident Incorrectness

Formal benchmarks, briefly

ARC-AGI-1 / 2

Humanity's Last Exam

FrontierMath

SWE-bench Verified

GPQA Diamond

Chatbot Arena (LMSYS)

MMLU / MMLU-Pro

GAIA

LiveCodeBench

Stanford HELM

LiveBench

MLPerf

Terminal-Bench 2.0

BrowseComp

OSWorld

TruthfulQA

SWE-Bench Pro

APEX-Agents

τ²-Bench

MCP-Atlas

MMMU-Pro

MRCR

BALROG

TheAgentCompany

Video-MME

EVMbench

The Weird AI
Test Museum