A collection of AI's strangest tests

The Weird AI
Test Museum

Formal benchmarks are fine. But the tests people actually remember are the ones where AI ate spaghetti strangely, drew extra fingers, processed an 18,000-water-cup request, or spent many hours navigating a cave in Pokemon.

A for-fun field guide to a fast-moving technology. This page does not rank models or claim which AI is "best." It collects memorable tests, explains what they exposed, and links to sources so you can look for yourself.
Scroll to explore
The Big Ones

Tests that became famous

These are the tests that broke through from niche AI circles into mainstream internet culture. Each one exposed something real.

Image Generation

When AI can't draw what it sees

Image generators can be impressive at landscapes, abstract art, and stylized scenes. Certain subjects remain useful stress tests.

The Hands & Fingers Problem

Six, seven, sometimes eight fingers per hand
Much improved on standard poses Less consistent on unusual angles Image generation 2022-2023 peak
Hands became a common informal check for AI-generated images because earlier systems often produced extra, merged, or misplaced fingers. Complex hand-object interactions remain more difficult than simple portraits.

Why it happens: Hands have many possible poses, are often partly hidden, and require precise relationships between anatomy and nearby objects. Newer image systems have improved substantially, but there is no single standardized “hand accuracy” percentage that applies across prompts and models.
Why it is difficult
Occlusion, varied poses, small anatomical details, and precise hand-object relationships
Related tests
Teeth, ears, arm positions, feet, and interactions between multiple subjects
Evaluation note
Judge a repeatable prompt set, not one selected success or failure
Current status
Standard poses are much improved. Complex interactions and unusual viewpoints remain less consistent.
🔤

Text-in-Image Rendering

When "OPEN" becomes "OEPN HOPRS"
Much improved for short text Image generation 2022-2024
Early image generators often produced text-like shapes instead of readable words on signs, covers, and labels. This made text rendering an easy visual stress test.

Why: Image generation must coordinate visual style with exact character sequences and layout. Systems including DALL-E 3 and later Midjourney versions showed clear improvements, while long phrases, unusual fonts, and small print remain harder.
Root cause
Diffusion models learn visual patterns, not symbolic meaning of characters
Current status
Short text is much improved. Long phrases, stylized fonts, and small print remain less consistent.
🐦

Pelican Riding a Bicycle

Simon Willison's beloved SVG benchmark
Steadily improving Code generation / spatial reasoning Est. late 2024
Developer Simon Willison asks every new LLM the same thing: "Generate an SVG of a pelican riding a bicycle." It started as a joke but became genuinely useful. The task requires understanding what both a pelican and a bicycle look like, figuring out how they'd combine, and writing precise SVG code to draw it, all in a single attempt.

Results range from recognizable scenes to abstract compositions. Willison tracks outputs across models and has used side-by-side judging to compare them. Common inconsistencies include wheel orientation, subject identity, and how the pelican is positioned on the bicycle.
What it tests
Spatial reasoning, concept composition, SVG/code generation, visual understanding without vision
Why it's hard
Combines two distinct visual concepts into a coherent scene using code, not pixels
Evaluation note
Rendering and visual review matter because valid SVG code can still depict the requested scene poorly
Current status
Results continue to improve, but model ordering changes and depends on the judging setup.
🦷

The Teeth Problem

Uncanny grins with 40+ teeth
Much improved Image generation 2022-2023
Like fingers, teeth are small, repetitive structures. Earlier image systems often produced inconsistent tooth counts or shapes. Newer systems have improved substantially, though close-up smiles remain a useful detail check.
Language & Reasoning

Simple questions that shouldn't be hard

Small language tasks that reveal how model behavior changes with tokenization, prompting, and reasoning strategy.

🔄 Spell It Backwards

String reversal and character manipulation

"Spell 'banana' backwards." Same root cause as strawberry: tokenization means models don't see individual characters. Chain-of-thought helps, but native ability remains inconsistent. Try asking a model to reverse a long, unfamiliar word.

Improving

🔢 Count the Objects

Exact counting in images and text

"How many dogs are in this picture?" Models approximate rather than count precisely. The attention mechanism is probabilistic, not exact. Works okay for 2-3 objects, degrades fast with more. Related to the letter-counting problem.

Improving slowly

📋 Spatial Reasoning

"Put the red ball to the LEFT of the blue cube"

Image generators struggle with relative positioning. "Left of," "behind," "between" are encoded poorly in text embeddings. This is why compositional image benchmarks (T2I-CompBench) exist as a formal test.

Improving

🎲 Simple Logic Puzzles

"A is taller than B. B is taller than C. Who is shortest?"

Models can solve these with chain-of-thought, but without it, they often guess wrong. Multi-step relational reasoning without explicit step-by-step prompting remains fragile, especially with more than 3-4 entities.

Improving

🧠 The 9.11 vs 9.9 Problem

"Which is larger, 9.11 or 9.9?"

Some models say 9.11 because "11 > 9". They're pattern-matching the digits after the decimal point rather than understanding decimal place value. A revealing test of whether a model actually "understands" numbers or just manipulates symbols.

Mostly fixed in frontier models

💬 Pronoun Resolution

"The trophy doesn't fit in the suitcase because IT is too big. What is too big?"

Classic Winograd schema problems. The answer depends on common-sense physical reasoning (the trophy is too big, not the suitcase). Models have gotten much better at these, but ambiguous pronouns in longer passages still cause errors.

Strong on standard examples
Game Benchmarks

Games as testing grounds

Researchers keep turning to games because they're controlled environments that test planning, creativity, and adaptation, and the results are easy for anyone to understand.

⛏ MC-Bench (Minecraft)

Built by a 16-year-old developer

AI models write code to build structures in Minecraft: cottages, castles, snowmen. Tests spatial reasoning, creativity, and code execution in a 3D environment. Created by Adonis Singh, who said the leaderboard "aligns closely with my own experience of using these models."

ActiveSpatial + code

🎨 LLM Pictionary

One AI draws, another guesses

Developer Paul Calcraft built a platform where two AI models play Pictionary. One doodles, the other guesses. Tests spatial understanding, concept communication, and strategic thinking. Designed to be "un-gameable" by memorization.

ActiveVisual + strategic

🎮 ARC-AGI-3

Humans solve the environments; frontier AI remained below 1% in March 2026 testing

Handcrafted interactive environments provide no task instructions or stated goals. Agents must explore, infer how each world works, and learn to succeed. The ARC-AGI-3 technical report says frontier systems scored below 1% as of March 2026; exact leaderboard numbers depend on the dated model and scoring snapshot.

Open challenge$2M prize program

♟ AI Connect 4 / Board Games

Strategic thinking in simple games

A British programmer created platforms where AI models play Connect 4, Pictionary, and other simple games against each other. Tests strategic decision-making in constrained environments. Results are often surprising: raw intelligence doesn't guarantee good game play.

Active
Video & Motion

Beyond spaghetti: other motion tests

Video generators, audio systems, and coding models can all look convincing briefly. These tests examine whether motion stays coherent when physics, timing, and interaction matter.

A red ball bouncing inside a rotating white hexagon
PROJECT EXAMPLE • BALL BOUNCING INSIDE A SPINNING HEXAGON

⬢ The Rotating Hexagon Test

A compact coding prompt with surprisingly unforgiving physics
Mostly solved in short/simple cases Code generation Physics simulation

The prompt sounds simple: draw a rotating hexagon, place a ball inside it, and make the ball respond to gravity, friction, and the walls. It is a popular informal test because one small animation exercises geometry, animation, and debugging at the same time.

Where it came from: bouncing-ball simulations are a classic programming exercise. This spinning-hexagon version was posted by developer Flavio Adamo on January 31, 2025 as a comparison between AI coding models.

Why it is complicated: the wall is moving while the ball collides with it. Correct code must find the nearest edge, resolve penetration, calculate the wall's local velocity from the hexagon's rotation, and reflect the ball relative to that moving surface. Corners, high speeds, and varying frame rates can produce tunneling, jitter, energy gain, or a ball that escapes.

What it tests
Canvas or graphics APIs, vector math, collision detection, animation loops, and instruction following
Easy to fake
A clip may look convincing briefly while using unstable or physically incorrect collision handling
Hard mode
Resize the scene, vary frame rate, increase rotation speed, and run long enough to expose drift or escapes
Current status
Common coding models often make a workable demo, but robust physics and consistent behavior still depend on the prompt and implementation

🎥 Physics Violations

Water flowing upward, objects passing through each other

Generated videos still routinely break basic physics: liquid ignoring gravity, solid objects clipping through each other, cloth that doesn't drape properly. These are hard to detect in still frames but obvious in motion.

Improving

👤 Identity Preservation

The face that drifts between frames

Early video models often changed facial identity across frames. Temporal consistency has improved substantially since 2024, but profile turns and quick movements can still cause drift.

Major progress

🎤 Audio-Visual Sync

2025: the year AI video "left the silent film era"

Until 2025, most AI video had no audio at all. Veo 3 introduced native audio generation, but sync remains imperfect (the infamous "crunchy spaghetti" sound). Lip sync during speech is advancing fast but still detectable as uncanny in most cases.

New frontier

📸 Background Character Behavior

The extras who don't quite act human

While main subjects can look convincing in short clips, background characters may loop, freeze, or perform inconsistent actions. This makes background motion a useful repeatability test.

Still obvious

🎧 Noisy-Scene Reasoning

Hearing words is easier than understanding the room

Audio models can transcribe clean speech well, yet higher-order reasoning can collapse when speech overlaps with laughter, weather, music, or classroom noise. RSA-Bench was introduced in January 2026 to test this gap.

Still fragile

🎙 Voice-Cloning Robustness

A clean demo is not a deployment test

Change the microphone, language, reference noise, speaking length, or compression and voice similarity can degrade sharply. RVCBench evaluates these real-world shifts across the generation pipeline.

Improving fast
Hallucinations & Confabulation

When AI makes things up confidently

Fluent answers can still contain unsupported claims, mismatched sources, invented quotations, or incorrect citations.

📰 Citation Verification

Checking papers, URLs, quotations, and support

Models can produce plausible-looking paper titles, authors, URLs, quotations, and legal citations that do not exist or do not support the claim. Treat every citation as a lead to verify, not proof by itself.

Getting better but not solved

🥤 18,000 Water Cups

A widely reported drive-through edge case

In August 2025, widely shared footage showed a drive-through voice system processing a request for 18,000 water cups. As a test case, it asks whether an ordering agent can detect unusual quantities, manage correction loops, and hand off cleanly to a person.

Still fragileReal-world test

🔍 Source Verification

Can the answer survive a citation audit?

Ask for primary sources, publication dates, direct support for each claim, and a clear statement of uncertainty. The test is whether every important sentence can be traced back to evidence without invented or mismatched citations.

Still fragile

📈 Confident Incorrectness

Fluency is not the same as certainty

Fluent wording is not calibrated confidence. A model can present a false answer in the same polished style as a correct one, so important claims still need source checks and uncertainty should be evaluated separately from tone.

Structural issue
The Serious Stuff

Formal benchmarks, briefly

The academic and industry-standard tests that labs use to compare models. Listed here for reference, not as endorsements of any particular score or model.

ARC-AGI-1 / 2

Reasoning • Visual puzzles

Grid-based pattern puzzles testing fluid intelligence. ARC-AGI-2 adds efficiency metrics and contamination controls. Designed by Francois Chollet.

Humanity's Last Exam

Multi-domain • 2,500 expert-level questions

Expert-crafted questions across dozens of subjects. Scores move quickly: Google reported 48.4% for Gemini 3.1 Deep Think in February 2026 under its stated no-tools setting.

FrontierMath

Mathematics • Research-level

Research-level math from Epoch AI. Includes an "Open Problems" split testing unsolved conjectures. One of the few math benchmarks that hasn't saturated.

SWE-bench Verified

Coding • Real GitHub issues

Models resolve real bug reports from popular Python repositories end-to-end. It is a widely used coding-agent benchmark, and reported scores have risen dramatically since 2023.

GPQA Diamond

Science • 198 PhD-level questions

Graduate-level science questions designed to be unsearchable. PhD experts score about 65%. Some AI models have reportedly surpassed this on published leaderboards.

Chatbot Arena (LMSYS)

Human preference • Blind comparisons

Users choose between anonymous model responses in an Elo-style ranking. It measures preference and satisfaction, not necessarily factual correctness.

MMLU / MMLU-Pro

Knowledge • 57+ subjects

A longstanding general-knowledge benchmark on which frontier scores have become high. MMLU-Pro uses harder questions with 10 answer options instead of 4.

GAIA

Agentic • Multi-step real-world tasks

Tests whether AI can complete multi-step tasks autonomously: browsing the web, using tools, combining information across sources.

LiveCodeBench

Coding • Continuously updated

Fresh competitive programming problems harvested from LeetCode, AtCoder, CodeForces. Designed to be contamination-resistant since problems didn't exist during training.

Stanford HELM

Safety • 7 dimensions, 42 scenarios

A broad evaluation framework covering accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

LiveBench

Knowledge • Anti-contamination

Uses regularly refreshed questions to reduce memorization and training-data contamination.

MLPerf

Hardware • Training + inference speed

Industry standard for benchmarking hardware and ML frameworks. Compares GPUs, TPUs, and software stacks on standardized workloads.

Terminal-Bench 2.0

Agents • Real terminal workflows

Tests whether agents can inspect files, use command-line tools, edit code, recover from errors, and finish multi-step work in containerized environments.

BrowseComp

Research agents • 1,266 questions

Measures persistent, strategic web research using questions whose answers are hard to find but relatively easy to verify once located.

OSWorld

Computer use • Desktop workflows

Evaluates multimodal agents on practical tasks across operating systems and desktop applications, including navigation, editing, and multi-step completion.

TruthfulQA

Factuality • Misconception resistance

Tests whether a model gives truthful, informative answers instead of imitating common human falsehoods and popular misconceptions.

SWE-Bench Pro

Coding agents • Longer repository tasks

A harder software-engineering benchmark with diverse repositories, multi-file changes, and contamination controls.

APEX-Agents

Professional agents • Cross-application work

Long-horizon tasks created by professionals in investment banking, consulting, and corporate law.

τ²-Bench

Tool use • User-agent coordination

Tests conversational agents in shared environments where both the user and agent must take actions and coordinate.

MCP-Atlas

Tool use • Real MCP servers

Evaluates multi-step tool use across 36 real Model Context Protocol servers and 220 tools.

MMMU-Pro

Multimodal • Expert visual reasoning

A harder MMMU variant designed to remove text-only shortcuts and require integrated visual and textual understanding.

MRCR

Long context • Multiple hidden needles

Tests whether a model can distinguish and retrieve the correct repeated item from a very long multi-turn context.

BALROG

Agents • Long-horizon games

Measures planning, exploration, spatial reasoning, and adaptation across game environments of varying complexity.

TheAgentCompany

Workplace agents • Simulated company

Agents browse internal sites, write code, run programs, and communicate with simulated coworkers in a self-contained workplace.

Video-MME

Video understanding • Multimodal reasoning

Evaluates perception and reasoning over videos across duration, domain, and modality settings.

EVMbench

Security agents • Smart contracts

Tests whether agents can detect, patch, and exploit curated high-severity smart-contract vulnerabilities in controlled environments.

June 2026 snapshot: which models posted strong results?

There is no universal pass mark. This compact view reproduces selected published figures from Google DeepMind's 2026 Gemini comparison page. Model names, tool settings, reasoning levels, and harnesses matter. A high score means strong performance on that specific setup, not that the underlying capability is completely solved.

BenchmarkGemini 3.1 ProClaude Sonnet 4.6Claude Opus 4.6GPT published comparisonMuseum reading
Humanity's Last ExamNo tools44.4%33.2%40.0%GPT-5.2: 34.5%Partial; no model is near complete coverage
ARC-AGI-2ARC Prize verified77.1%58.3%68.8%GPT-5.2: 52.9%Strong progress, still setup-specific
GPQA DiamondNo tools94.3%89.9%91.3%GPT-5.2: 92.4%Very high published scores
Terminal-Bench 2.0Terminus-2 harness68.5%59.1%65.4%GPT-5.3-Codex: 64.7%Capable, but many tasks remain
SWE-Bench Pro PublicSingle attempt54.2%GPT-5.3-Codex: 56.8%Partial on harder coding work
APEX-Agents33.5%29.8%GPT-5.2: 23.0%Long-horizon professional work remains open
τ²-BenchRetail / telecom90.8% / 99.3%91.7% / 97.9%91.9% / 99.3%GPT-5.2: 82.0% / 98.7%Strong in these defined domains
MCP-Atlas69.2%61.3%59.5%GPT-5.2: 60.6%Useful tool skill, not universal reliability
BrowseCompSearch + Python + browse85.9%74.7%84.0%GPT-5.2: 65.8%Strong published browsing results
MMMU-ProNo tools80.5%74.5%73.9%GPT-5.2: 79.5%Strong, with meaningful remaining errors

Source and methodology: Google DeepMind's Gemini 3 performance table. Accessed June 5, 2026. Cross-provider figures are shown as reproduced by that source; this museum did not independently rerun them.