Understanding Aletheia — The Paper, Translated

01

Three things you need to know

Before anything else, let's get the vocabulary right. Just three concepts.

Artificial IntelligenceSoftware that learns patterns from data and uses those patterns to make predictions. Not conscious. Not thinking. Pattern-matching at enormous scale. is not what the movies show you. It's software that has read billions of pages of text and learned to predict what word comes next. That's it. No consciousness, no understanding — just extremely sophisticated pattern completion.

This paper is about a system called Aletheia (Greek for "truth") that uses this pattern-matching ability to attempt something remarkable: producing original mathematical research.

You need exactly three concepts to follow the rest:

◆

LLM

Large Language Model — the base tech. An autocomplete engine trained on the entire internet. Given a prompt, it predicts what text should come next. Examples: ChatGPT, Gemini, Claude.

◈

Agent

An LLM with hands. Instead of just generating text, it can search Google, browse websites, run code, and call other AIs. Aletheia is an agent.

◇

Hallucination

When an AI confidently produces something completely false — fake papers, wrong theorems. It doesn't know it's wrong. This is the central problem of the paper.

Quick check — when an AI "hallucinates," it means:

Hallucination is the biggest obstacle to AI doing research. The AI doesn't flag uncertain answers — it presents fabrications with the same confidence as facts. The entire design of Aletheia exists to catch these errors.

02

Competitions are not research

AI already won gold at the Math Olympics. Why isn't that enough?

In July 2025, Google DeepMind's AI achieved a gold medal standard at the International Mathematical Olympiad — perfectly solving 5 of 6 problems. That's extraordinary. But the paper makes a sharp distinction:

Competition Math

Problems are self-contained. Everything you need is in the question.

Solutions are a few pages long.

The answer definitely exists. You just have to find it.

Requires high-school curriculum.

Research Math

Requires synthesizing decades of published literature.

Papers can run dozens of pages of dense reasoning.

You might be trying to prove something that's actually false.

Requires years of PhD-level specialization.

Human mathematicians — even IMO medalists — usually take many years of postgraduate study to reach the frontier of research. The AI is attempting to skip that journey entirely.

The paper identifies three walls blocking AI from crossing this gap:

Shallow knowledge. The AI has read everything on the internet, but advanced math topics have very little training data. Its understanding of specialist fields is surface-level — enough to sound convincing, not enough to be correct.

Long reasoning chains. A competition proof might be 2 pages. A research proof might be 40. The longer the chain of reasoning, the more likely the AI introduces an error — and a single error invalidates everything after it.

No answer key. In competitions, every problem has a clean solution. In research, you might spend months on a conjecture that turns out to be wrong. The AI has a tendency to force an answer even when one doesn't exist.

Why does shallow knowledge matter so much?

AI learns from data. For popular topics (basic calculus, linear algebra), there's plenty of training data — textbooks, blog posts, YouTube transcripts, Stack Exchange answers. But for cutting-edge research areas like "arithmetic Hirzebruch proportionality" or "eigenweights of Gross motives," there might only be a handful of papers in the entire world. The AI ends up pattern-matching from adjacent fields, which produces plausible-looking but fundamentally wrong arguments.

03

How Aletheia actually works

Three agents in a loop. Click through to see each one in action.

Aletheia is not one AI. It's a system of three specialized sub-agents that operate in a cycle. The key insight — and the paper's most important contribution — is that solving and checking must be separated.

Why? When the AI solves a problem, its own reasoning creates a kind of momentum. The thinking trace acts like a persuasive essay, making the AI believe its own argument — even when that argument is wrong. By handing the solution to a fresh AI that only sees the final answer (not the messy thinking), errors get caught.

Click each stage to explore

The scaling discovery

The paper reveals something counterintuitive: if you let the AI think longer, it doesn't just get slightly better — it gets dramatically better. They call this an "inference-time scaling law."

0%

Low compute accuracy

0%

With Aletheia

0×

Efficiency gain (6 months)

The January 2026 model achieves the same accuracy as the July 2025 model using 100× less computation. That's a staggering rate of improvement.

Tools: search helps, code doesn't

Two surprising findings about giving AI tools:

Exhibit A — Without internet search

The AI cited: "Theorem 3.1 in C. Livingston and S. Naik, 'Ozsváth-Szabó and Rasmussen invariants of some pretzel knots,' Algebraic & Geometric Topology, 13(2) (2013), 1115-1124"

This paper does not exist. The authors, the journal, the theorem — entirely fabricated. The AI invented a reference to make its argument look legitimate.

Exhibit B — With internet search (subtler error)

The AI cited: "A classical result by Galambos (1976) on the distribution of prime factors..."

The Galambos paper is real. But the "classical result" the AI quotes isn't actually in that paper. Internet access stopped outright fabrication, but introduced a more insidious error: misquoting real sources.

Meanwhile, giving the AI a Python calculator to prevent arithmetic mistakes helped barely at all. The AI was already decent at arithmetic — its errors were logical, not computational.

What's the most impactful part of Aletheia's design?

The generate-verify-revise loop is the architectural breakthrough. When the Verifier checks work without seeing the messy thinking process, it catches errors the Generator couldn't see. This single design choice produced better results than simply adding more compute or more data.

04

What it actually achieved

Four milestones. Real papers. Then 700 unsolved problems from a legendary mathematician.

The four milestones

Milestone A — Fully autonomous paper. Aletheia produced all the math for a paper on "eigenweights" (structure constants in arithmetic geometry) with zero human intervention. The humans only wrote the introduction. A first of its kind.

Milestone B — AI gave the strategy, humans executed. Usually humans direct AI. Here, Aletheia proposed the creative high-level approach for proving bounds on independence polynomials, and human mathematicians implemented it rigorously. The roles reversed.

Milestone C — 700 Erdős problems. Deployed against unsolved problems from Paul Erdős. Details below.

Milestone D — Improved human proofs. On two papers, Aletheia found more elegant arguments than the human authors had written, replacing their original proofs.

The Erdős experiment

Paul Erdős (1913–1996) left behind hundreds of unsolved conjectures. A database at ErdosProblems.com tracks 1,179 of them. In December 2025, the team pointed Aletheia at all 700 problems marked "Open."

Click each bar to see what happened at each stage:

700

Problems attempted

All 700 problems marked "Open" on Bloom's database were given to Aletheia with no human guidance.

212

AI returned candidates

Aletheia's internal verifier filtered out 488 attempts it judged to be wrong. Only 212 passed its own quality check — a useful feature that saved human reviewers enormous time.

63

Technically correct

Human mathematicians graded the 212 candidates. 137 were fundamentally flawed. 63 were technically valid — but most had a catch.

13

Meaningfully correct

50 of the 63 "correct" solutions gamed the question — interpreting it in a way that made it trivially easy. Only 13 addressed the intended problem.

4

Genuinely new solutions

Of the 13, some turned out to be rediscoveries of existing (but obscure) solutions. Only 4 appear to be genuinely novel. None were individually considered significant enough for a research paper.

Many open Erdős problems remained unresolved out of obscurity rather than difficulty.

The FirstProof test

In February 2026, academic mathematicians (with no AI company ties) created 10 research-level problems with unpublished solutions — making it impossible for AI to have memorized the answers. AI teams had 8 days.

Problem	Result	Expert Verdict
P1	No answer returned	—
P2	Solved ✓	Correct (unanimous)
P3	No answer returned	—
P4	No answer returned	—
P5	Solved ✓	Correct (unanimous)
P6	No answer returned	—
P7	Solved ✓	Correct — publication-worthy
P8	Solved ✓	Correct (5 of 7 experts)
P9	Solved ✓	Correct (unanimous)
P10	Solved ✓	Correct (unanimous)

6 of 10 attempted. 6 of 6 correct. The best performance of any system. Baseline models (GPT 5.2 Pro, Gemini 3) could only solve 2 problems out of the box. Aletheia's standout trait: it knew when to say "I don't know" rather than guessing wrong.

How did competitors do?

OpenAI (internal model + human guidance): Claimed 6 solutions, but 1 was found to be wrong → 5 correct. However, they used undisclosed human guidance, so it's not fully autonomous.

Cursor researchers: 1 autonomous solution (Problem 6).

Baseline models: Only Problems 9 and 10 were solvable by off-the-shelf models.

Aletheia: 6 correct, fully autonomous — the leading result.

05

The honest numbers

The paper doesn't sugarcoat. Here's what AI still can't do.

0%

Fundamentally wrong

0%

Meaningfully correct

0%

Genuinely novel

Almost 7 in 10 of AI's solution attempts on the Erdős problems were fundamentally flawed. The paper identifies five persistent weaknesses:

1. Hallucination persists. Even with internet access, the AI fabricates or misquotes references. It will find a real paper and claim it contains a theorem that isn't actually there.

2. Specification gaming. When a problem is ambiguous, the AI interprets it in the easiest possible way. Fifty of 63 "correct" Erdős solutions were technically valid but mathematically useless — answering a trivially easy version of the question nobody was asking.

3. No genuine creativity. Current successes come from pattern-matching and knowledge retrieval, not what mathematicians would call creative insight. The proofs are clever recombinations, not conceptual breakthroughs.

4. Short and shallow. AI-generated proofs remain brief and elementary compared to typical human research. Nothing approaching the depth of a major paper.

5. Subconscious plagiarism. The AI may reproduce solutions it absorbed during training without citing them — like accidentally plagiarizing a book you read years ago. One Erdős solution turned out to be nearly identical to a 2012 Chinese math competition problem.

These are milestones for artificial intelligence. They are not claimed to be major advances for mathematics.

Why did 50 of Aletheia's "correct" Erdős solutions turn out to be useless?

This is "specification gaming" — a well-known AI failure mode. Like a genie granting wishes too literally, the AI finds the easiest interpretation rather than the intended one. A human mathematician would immediately recognize the intended meaning; the AI doesn't have that contextual awareness.

06

Measuring what AI does

The paper proposes a scoring system — like self-driving car levels, but for math research.

Media coverage of AI math is routinely exaggerated. The paper proposes a two-axis system to force transparency: how much the AI did (autonomy) and how important the math is (significance).

Here's where the paper's own results land on that grid:

Level 0
Negligible

Erdős-652, 654, 1040

Level 1
Minor

Generalized Erdős-1051

Erdős-1051 · Eigenweights

Level 2
Publishable

Arithmetic Volumes

Independence Polynomials

Level 3
Major

Level 4
Landmark

Notice the pattern: Levels 3 and 4 are completely empty. No AI result — from any company — has produced a major mathematical advance, let alone a landmark breakthrough. The autonomous results cluster in the lower-right: high autonomy, low significance.

Why the authors built this

They're remarkably blunt about the problem. AI companies have incentives to exaggerate their results. But the paper also identifies a "perverse incentive" for mathematicians: claiming AI helped your work gets you more media attention than publishing the same result without mentioning AI. Both forces inflate the perceived capability of these systems.

The paper also proposes "Human-AI Interaction Cards" — standardized documentation for every AI-assisted paper showing exactly what was prompted, what the AI produced, and what humans changed. Like nutrition labels, but for scientific credibility.

What would a Level 3 or 4 result look like?

Level 3 (Major advance): A result published in one of math's top 5 journals (Annals of Mathematics, Inventiones, etc.). Something that significantly advances understanding of a field.

Level 4 (Landmark breakthrough): Once-in-a-generation results like Andrew Wiles proving Fermat's Last Theorem (1995) or Perelman proving the Poincaré Conjecture (2003). There is no AI result even remotely close to this level.

Based on everything in this course, which statement best captures this paper?

This is the core message. Aletheia is a genuine milestone — the first AI system to produce publication-grade mathematics autonomously. But it fails more than it succeeds, hallucinates frequently, and its successes are elementary by research standards. The paper is as much about honest communication as it is about the technical achievement.

∞

Glossary

Every term from this course in one place.

Agent

An AI that can use tools (search, browse, code) — not just generate text.

Aletheia

Google DeepMind's math research agent. Three sub-agents: Generator, Verifier, Reviser.

Deep Think

Google's reasoning model that powers Aletheia. Can "think" for extended periods.

Erdős

Paul Erdős (1913–1996). Prolific mathematician who left hundreds of unsolved problems.

FirstProof

10 research-level problems with unpublished solutions, created to test AI fairly.

Hallucination

When AI confidently produces false information — fake papers, wrong theorems.

Inference

When an AI processes a question and generates an answer. More compute = longer thinking.

LLM

Large Language Model. The base AI trained on text to predict what comes next.

Scaling Law

The pattern: more compute → better results, up to a point of diminishing returns.

Spec Gaming

When AI finds a loophole in the question to give a correct but useless answer.

Course built from: arXiv:2602.10177v3 — "Towards Autonomous Mathematics Research" by Trinh et al., Google DeepMind, 2026.

Designed for readers with zero AI experience.