The four milestones
Milestone A — Fully autonomous paper. Aletheia
produced all the math for a paper on "eigenweights" (structure
constants in arithmetic geometry) with zero human intervention. The
humans only wrote the introduction. A first of its kind.
Milestone B — AI gave the strategy, humans executed.
Usually humans direct AI. Here, Aletheia proposed the creative
high-level approach for proving bounds on independence polynomials,
and human mathematicians implemented it rigorously. The roles
reversed.
Milestone C — 700 Erdős problems. Deployed against
unsolved problems from Paul Erdős. Details below.
Milestone D — Improved human proofs. On two papers,
Aletheia found more elegant arguments than the human authors had
written, replacing their original proofs.
The Erdős experiment
Paul Erdős (1913–1996) left behind hundreds of unsolved conjectures. A
database at
ErdosProblems.com tracks 1,179
of them. In December 2025, the team pointed Aletheia at all 700
problems marked "Open."
Click each bar to see what happened at each stage:
All 700 problems marked "Open" on Bloom's database were given to
Aletheia with no human guidance.
212
AI returned candidates
Aletheia's internal verifier filtered out 488 attempts it judged to
be wrong. Only 212 passed its own quality check — a useful feature
that saved human reviewers enormous time.
Human mathematicians graded the 212 candidates. 137 were
fundamentally flawed. 63 were technically valid — but most had a
catch.
50 of the 63 "correct" solutions gamed the question — interpreting
it in a way that made it trivially easy. Only 13 addressed the
intended problem.
4
Genuinely new solutions
Of the 13, some turned out to be rediscoveries of existing (but
obscure) solutions. Only 4 appear to be genuinely novel. None were
individually considered significant enough for a research paper.
Many open Erdős problems remained unresolved out of obscurity rather
than difficulty.
The FirstProof test
In February 2026, academic mathematicians (with no AI company ties)
created 10 research-level problems with unpublished solutions — making
it impossible for AI to have memorized the answers. AI teams had 8
days.
| Problem |
Result |
Expert Verdict |
| P1 |
No answer returned |
— |
| P2 |
Solved ✓ |
Correct (unanimous) |
| P3 |
No answer returned |
— |
| P4 |
No answer returned |
— |
| P5 |
Solved ✓ |
Correct (unanimous) |
| P6 |
No answer returned |
— |
| P7 |
Solved ✓ |
Correct — publication-worthy |
| P8 |
Solved ✓ |
Correct (5 of 7 experts) |
| P9 |
Solved ✓ |
Correct (unanimous) |
| P10 |
Solved ✓ |
Correct (unanimous) |
6 of 10 attempted. 6 of 6 correct. The best
performance of any system. Baseline models (GPT 5.2 Pro, Gemini 3)
could only solve 2 problems out of the box. Aletheia's standout trait:
it knew when to say "I don't know" rather than guessing
wrong.
How did competitors do?
OpenAI (internal model + human guidance): Claimed
6 solutions, but 1 was found to be wrong → 5 correct. However,
they used undisclosed human guidance, so it's not fully
autonomous.
Cursor researchers: 1 autonomous solution
(Problem 6).
Baseline models: Only Problems 9 and 10 were
solvable by off-the-shelf models.
Aletheia: 6 correct, fully autonomous — the
leading result.