General Relativity·Graduate level·7 Contestants·4 Confirmed / 14 Flagged
The Problem
Derive r_s from the Einstein field equations in vacuum.
Each contestant was asked to start from the Einstein field equations Gμν = 0 (vacuum) and derive the Schwarzschild radius from first principles, assuming spherical symmetry and staticity of the metric.
The canonical path: write the most general static, spherically symmetric line element; compute the Christoffel symbols; evaluate the non-trivial Ricci tensor components; impose Rμν = 0; integrate; match to the weak-field Newtonian limit to fix the integration constant. The coordinate-singular surface gtt(r) = 0 gives the Schwarzschild radius r_s = 2GM / c².
This is textbook general relativity — the kind of derivation every graduate student runs through before they're let near anything harder. Every one of the seven contestants reached the correct final formula. What distinguishes them is the integrity of the intermediate algebra.
Results at a glance
Seven contestants. One formula. Different paths there.
Model
Response Length
Claims
Survival Rate
Confirmed / Flagged
Claude
5,918 ch
52
0.903
1 / 3
GPT-5.4
4,861 ch
52
0.946
0 / 2
o3-mini
4,070 ch
35
1.000
0 / 0
Gemini
5,140 ch
40
0.842
0 / 3
DeepSeek R1
5,003 ch
40
0.931
0 / 2
Qwen
4,492 ch
42
0.926
2 / 2
Grok
4,730 ch
38
0.917
1 / 2
What cross-family confirmation caught
Four errors that survived independent re-review.
Of 14 errors flagged by single adversaries, 4 held up under independent re-review by a second adversary from a different model family. Here are the real physics errors that survived cross-family confirmation.
ClaudeR_rr calculationclaim #16
Claude's expression for the Ricci tensor component R_rr came out off by an overall factor of 2 compared to the standard derivation. The error propagates into the algebra that fixes eλ but happens to cancel by the final step — the formula still lands on r_s = 2GM/c², which is why a final-answer check misses it.
First flag: GeminiConfirmed by: DeepSeek R1
Qwenmetric ansatzclaim #32
Qwen wrote eλ = 1 / (1 + k/r) with the wrong sign of the integration constant. The standard derivation yields 1 / (1 - C/r); the sign choice depends on how the integration constant is defined relative to the mass parameter, but Qwen's statement conflicts with its own subsequent identification C = 2GM/c².
First flag: ClaudeConfirmed by: Grok
Qwenintegration stepclaim #37 (second confirmed)
A second independently-confirmed Qwen claim in the same derivation — reviewers from two different families flagged an inconsistency in the step that integrates the vacuum field equation to yield the metric coefficients. See raw Layer 4/4b output on GitHub for the full verdict text; the error is of the same "sign-or-factor-drift-that-cancels" family as the first one.
First flag: GPT-5.4Confirmed by: Claude
Grokoff-diagonal Ricciclaim #36
Grok invoked R_tr to derive the constraint ∂_r(A'/A + B'/B) = 0. The problem: for a static, spherically symmetric metric of the form Grok wrote, R_tr vanishes identically — no such constraint arises from it. The needed relationship comes from the combination of R_tt and R_rr, not from an off-diagonal component that was zero to begin with.
First flag: DeepSeek R1Confirmed by: Qwen
What this tells us
Every model reached the right answer. Not every derivation was right.
All 7 of 7 contestants landed on the correct final formular_s = 2GM / c². A final-answer-only check would grade every model correct.
Cross-family confirmation exposes 4 real derivation errors that benchmark-style evaluation misses. These are sign errors, factor drifts, and invocations of vanishing components — the kind of thing that cancels by the end but indicates the derivation is not actually sound.
Schwarzschild has the lowest confirmation rate of our three problems (28.6% vs. 33.3% Rindler, 30.0% Casimir). Our working hypothesis: this problem is so textbook-saturated that adversaries from different families share more training-data intuition about what the derivation "should" look like, suppressing cross-family disagreement on flagged errors.
Adversary lens matters more than contestant accuracy. o3-mini had 0 flagged errors, but also gave the shortest response (4,070 ch / 35 claims) — fewer intermediate claims means a smaller surface area to flag. Claims-density is not the same as rigor.
Methodology Limitations
Read this before citing any number on this page. These limitations are ordered roughly by how much they should move your priors.
• No symbolic verification
Every verdict in this pipeline is produced by an LLM reading LaTeX. There is no SymPy check, no theorem prover. "Confirmed error" means two LLMs from different families independently flagged a step as wrong; it does not mean the step is mathematically wrong. Layer 3 (symbolic verification) is the top roadmap item.
• Stochastic verdicts
Running Layer 4b twice on the same output can produce different confirmation counts due to LLM sampling variance (observed ±20% across reruns). Reported numbers are single-run.
• Parse failure rate
Approximately 33% of Layer 4b calls returned unparseable responses (primarily from o3-mini on OpenRouter). Parse failures are treated as non-confirmations — which systematically under-counts real errors.
• Manual verification required
Confirmed errors should be manually spot-checked before use. Not all "confirmed" errors are unambiguous physics errors — some are notation or convention disagreements between adversaries.
• Ground truth in prompt
The final-answer check is a regex for the formula the problem statement asked the model to derive. Passing this check means the model produced a matching formula, not that the derivation constitutes a valid proof.
Raw Data
Audit every step.
Every Layer 1 response, the full claim decomposition, every Layer 4 verdict, and every Layer 4b confirmation for this problem is on GitHub: