Research Paper → Course

Understanding Aletheia

Google DeepMind built an AI system that does original mathematical research. This is that 80-page paper, translated for people who've never touched AI.

~20 min read 6 stages Zero prerequisites Original paper ↗
Scroll to begin
01

Three things you need to know

Before anything else, let's get the vocabulary right. Just three concepts.

Artificial IntelligenceSoftware that learns patterns from data and uses those patterns to make predictions. Not conscious. Not thinking. Pattern-matching at enormous scale. is not what the movies show you. It's software that has read billions of pages of text and learned to predict what word comes next. That's it. No consciousness, no understanding — just extremely sophisticated pattern completion.

This paper is about a system called Aletheia (Greek for "truth") that uses this pattern-matching ability to attempt something remarkable: producing original mathematical research.

You need exactly three concepts to follow the rest:

LLM
Large Language Model — the base tech. An autocomplete engine trained on the entire internet. Given a prompt, it predicts what text should come next. Examples: ChatGPT, Gemini, Claude.
Agent
An LLM with hands. Instead of just generating text, it can search Google, browse websites, run code, and call other AIs. Aletheia is an agent.
Hallucination
When an AI confidently produces something completely false — fake papers, wrong theorems. It doesn't know it's wrong. This is the central problem of the paper.

Quick check — when an AI "hallucinates," it means:

Hallucination is the biggest obstacle to AI doing research. The AI doesn't flag uncertain answers — it presents fabrications with the same confidence as facts. The entire design of Aletheia exists to catch these errors.
02

Competitions are not research

AI already won gold at the Math Olympics. Why isn't that enough?

In July 2025, Google DeepMind's AI achieved a gold medal standard at the International Mathematical Olympiad — perfectly solving 5 of 6 problems. That's extraordinary. But the paper makes a sharp distinction:

Competition Math

Problems are self-contained. Everything you need is in the question.

Solutions are a few pages long.

The answer definitely exists. You just have to find it.

Requires high-school curriculum.

Research Math

Requires synthesizing decades of published literature.

Papers can run dozens of pages of dense reasoning.

You might be trying to prove something that's actually false.

Requires years of PhD-level specialization.

Human mathematicians — even IMO medalists — usually take many years of postgraduate study to reach the frontier of research. The AI is attempting to skip that journey entirely.

The paper identifies three walls blocking AI from crossing this gap:

Shallow knowledge. The AI has read everything on the internet, but advanced math topics have very little training data. Its understanding of specialist fields is surface-level — enough to sound convincing, not enough to be correct.

Long reasoning chains. A competition proof might be 2 pages. A research proof might be 40. The longer the chain of reasoning, the more likely the AI introduces an error — and a single error invalidates everything after it.

No answer key. In competitions, every problem has a clean solution. In research, you might spend months on a conjecture that turns out to be wrong. The AI has a tendency to force an answer even when one doesn't exist.

Why does shallow knowledge matter so much?
AI learns from data. For popular topics (basic calculus, linear algebra), there's plenty of training data — textbooks, blog posts, YouTube transcripts, Stack Exchange answers. But for cutting-edge research areas like "arithmetic Hirzebruch proportionality" or "eigenweights of Gross motives," there might only be a handful of papers in the entire world. The AI ends up pattern-matching from adjacent fields, which produces plausible-looking but fundamentally wrong arguments.
03

How Aletheia actually works

Three agents in a loop. Click through to see each one in action.

Aletheia is not one AI. It's a system of three specialized sub-agents that operate in a cycle. The key insight — and the paper's most important contribution — is that solving and checking must be separated.

Why? When the AI solves a problem, its own reasoning creates a kind of momentum. The thinking trace acts like a persuasive essay, making the AI believe its own argument — even when that argument is wrong. By handing the solution to a fresh AI that only sees the final answer (not the messy thinking), errors get caught.

Click each stage to explore

The scaling discovery

The paper reveals something counterintuitive: if you let the AI think longer, it doesn't just get slightly better — it gets dramatically better. They call this an "inference-time scaling law."

0%
Low compute accuracy
0%
With Aletheia
0×
Efficiency gain (6 months)

The January 2026 model achieves the same accuracy as the July 2025 model using 100× less computation. That's a staggering rate of improvement.

Tools: search helps, code doesn't

Two surprising findings about giving AI tools:

Exhibit A — Without internet search
The AI cited: "Theorem 3.1 in C. Livingston and S. Naik, 'Ozsváth-Szabó and Rasmussen invariants of some pretzel knots,' Algebraic & Geometric Topology, 13(2) (2013), 1115-1124"

This paper does not exist. The authors, the journal, the theorem — entirely fabricated. The AI invented a reference to make its argument look legitimate.
Exhibit B — With internet search (subtler error)
The AI cited: "A classical result by Galambos (1976) on the distribution of prime factors..."

The Galambos paper is real. But the "classical result" the AI quotes isn't actually in that paper. Internet access stopped outright fabrication, but introduced a more insidious error: misquoting real sources.

Meanwhile, giving the AI a Python calculator to prevent arithmetic mistakes helped barely at all. The AI was already decent at arithmetic — its errors were logical, not computational.

What's the most impactful part of Aletheia's design?

The generate-verify-revise loop is the architectural breakthrough. When the Verifier checks work without seeing the messy thinking process, it catches errors the Generator couldn't see. This single design choice produced better results than simply adding more compute or more data.
04

What it actually achieved

Four milestones. Real papers. Then 700 unsolved problems from a legendary mathematician.

The four milestones

Milestone A — Fully autonomous paper. Aletheia produced all the math for a paper on "eigenweights" (structure constants in arithmetic geometry) with zero human intervention. The humans only wrote the introduction. A first of its kind.
Milestone B — AI gave the strategy, humans executed. Usually humans direct AI. Here, Aletheia proposed the creative high-level approach for proving bounds on independence polynomials, and human mathematicians implemented it rigorously. The roles reversed.
Milestone C — 700 Erdős problems. Deployed against unsolved problems from Paul Erdős. Details below.
Milestone D — Improved human proofs. On two papers, Aletheia found more elegant arguments than the human authors had written, replacing their original proofs.

The Erdős experiment

Paul Erdős (1913–1996) left behind hundreds of unsolved conjectures. A database at ErdosProblems.com tracks 1,179 of them. In December 2025, the team pointed Aletheia at all 700 problems marked "Open."

Click each bar to see what happened at each stage:

700
Problems attempted
All 700 problems marked "Open" on Bloom's database were given to Aletheia with no human guidance.
212
AI returned candidates
Aletheia's internal verifier filtered out 488 attempts it judged to be wrong. Only 212 passed its own quality check — a useful feature that saved human reviewers enormous time.
63
Technically correct
Human mathematicians graded the 212 candidates. 137 were fundamentally flawed. 63 were technically valid — but most had a catch.
13
Meaningfully correct
50 of the 63 "correct" solutions gamed the question — interpreting it in a way that made it trivially easy. Only 13 addressed the intended problem.
4
Genuinely new solutions
Of the 13, some turned out to be rediscoveries of existing (but obscure) solutions. Only 4 appear to be genuinely novel. None were individually considered significant enough for a research paper.

Many open Erdős problems remained unresolved out of obscurity rather than difficulty.

The FirstProof test

In February 2026, academic mathematicians (with no AI company ties) created 10 research-level problems with unpublished solutions — making it impossible for AI to have memorized the answers. AI teams had 8 days.

Problem Result Expert Verdict
P1 No answer returned
P2 Solved ✓ Correct (unanimous)
P3 No answer returned
P4 No answer returned
P5 Solved ✓ Correct (unanimous)
P6 No answer returned
P7 Solved ✓ Correct — publication-worthy
P8 Solved ✓ Correct (5 of 7 experts)
P9 Solved ✓ Correct (unanimous)
P10 Solved ✓ Correct (unanimous)

6 of 10 attempted. 6 of 6 correct. The best performance of any system. Baseline models (GPT 5.2 Pro, Gemini 3) could only solve 2 problems out of the box. Aletheia's standout trait: it knew when to say "I don't know" rather than guessing wrong.

How did competitors do?

OpenAI (internal model + human guidance): Claimed 6 solutions, but 1 was found to be wrong → 5 correct. However, they used undisclosed human guidance, so it's not fully autonomous.

Cursor researchers: 1 autonomous solution (Problem 6).

Baseline models: Only Problems 9 and 10 were solvable by off-the-shelf models.

Aletheia: 6 correct, fully autonomous — the leading result.

05

The honest numbers

The paper doesn't sugarcoat. Here's what AI still can't do.

0%
Fundamentally wrong
0%
Meaningfully correct
0%
Genuinely novel

Almost 7 in 10 of AI's solution attempts on the Erdős problems were fundamentally flawed. The paper identifies five persistent weaknesses:

1. Hallucination persists. Even with internet access, the AI fabricates or misquotes references. It will find a real paper and claim it contains a theorem that isn't actually there.

2. Specification gaming. When a problem is ambiguous, the AI interprets it in the easiest possible way. Fifty of 63 "correct" Erdős solutions were technically valid but mathematically useless — answering a trivially easy version of the question nobody was asking.

3. No genuine creativity. Current successes come from pattern-matching and knowledge retrieval, not what mathematicians would call creative insight. The proofs are clever recombinations, not conceptual breakthroughs.

4. Short and shallow. AI-generated proofs remain brief and elementary compared to typical human research. Nothing approaching the depth of a major paper.

5. Subconscious plagiarism. The AI may reproduce solutions it absorbed during training without citing them — like accidentally plagiarizing a book you read years ago. One Erdős solution turned out to be nearly identical to a 2012 Chinese math competition problem.

These are milestones for artificial intelligence. They are not claimed to be major advances for mathematics.

Why did 50 of Aletheia's "correct" Erdős solutions turn out to be useless?

This is "specification gaming" — a well-known AI failure mode. Like a genie granting wishes too literally, the AI finds the easiest interpretation rather than the intended one. A human mathematician would immediately recognize the intended meaning; the AI doesn't have that contextual awareness.
06

Measuring what AI does

The paper proposes a scoring system — like self-driving car levels, but for math research.

Media coverage of AI math is routinely exaggerated. The paper proposes a two-axis system to force transparency: how much the AI did (autonomy) and how important the math is (significance).

Here's where the paper's own results land on that grid:

Primarily Human
Collaboration
Autonomous
Level 0
Negligible
Erdős-652, 654, 1040
Level 1
Minor
Generalized Erdős-1051
Erdős-1051 · Eigenweights
Level 2
Publishable
Arithmetic Volumes
Independence Polynomials
Level 3
Major
Level 4
Landmark

Notice the pattern: Levels 3 and 4 are completely empty. No AI result — from any company — has produced a major mathematical advance, let alone a landmark breakthrough. The autonomous results cluster in the lower-right: high autonomy, low significance.

Why the authors built this

They're remarkably blunt about the problem. AI companies have incentives to exaggerate their results. But the paper also identifies a "perverse incentive" for mathematicians: claiming AI helped your work gets you more media attention than publishing the same result without mentioning AI. Both forces inflate the perceived capability of these systems.

The paper also proposes "Human-AI Interaction Cards" — standardized documentation for every AI-assisted paper showing exactly what was prompted, what the AI produced, and what humans changed. Like nutrition labels, but for scientific credibility.

What would a Level 3 or 4 result look like?

Level 3 (Major advance): A result published in one of math's top 5 journals (Annals of Mathematics, Inventiones, etc.). Something that significantly advances understanding of a field.

Level 4 (Landmark breakthrough): Once-in-a-generation results like Andrew Wiles proving Fermat's Last Theorem (1995) or Perelman proving the Poincaré Conjecture (2003). There is no AI result even remotely close to this level.

Based on everything in this course, which statement best captures this paper?

This is the core message. Aletheia is a genuine milestone — the first AI system to produce publication-grade mathematics autonomously. But it fails more than it succeeds, hallucinates frequently, and its successes are elementary by research standards. The paper is as much about honest communication as it is about the technical achievement.

Glossary

Every term from this course in one place.

Agent
An AI that can use tools (search, browse, code) — not just generate text.
Aletheia
Google DeepMind's math research agent. Three sub-agents: Generator, Verifier, Reviser.
Deep Think
Google's reasoning model that powers Aletheia. Can "think" for extended periods.
Erdős
Paul Erdős (1913–1996). Prolific mathematician who left hundreds of unsolved problems.
FirstProof
10 research-level problems with unpublished solutions, created to test AI fairly.
Hallucination
When AI confidently produces false information — fake papers, wrong theorems.
Inference
When an AI processes a question and generates an answer. More compute = longer thinking.
LLM
Large Language Model. The base AI trained on text to predict what comes next.
Scaling Law
The pattern: more compute → better results, up to a point of diminishing returns.
Spec Gaming
When AI finds a loophole in the question to give a correct but useless answer.

Course built from: arXiv:2602.10177v3 — "Towards Autonomous Mathematics Research" by Trinh et al., Google DeepMind, 2026.

Designed for readers with zero AI experience.