The Schwarzschild Radius

The Problem

Derive r_s from the Einstein field equations in vacuum.

Each contestant was asked to start from the Einstein field equations G_μν = 0 (vacuum) and derive the Schwarzschild radius from first principles, assuming spherical symmetry and staticity of the metric.

The canonical path: write the most general static, spherically symmetric line element; compute the Christoffel symbols; evaluate the non-trivial Ricci tensor components; impose R_μν = 0; integrate; match to the weak-field Newtonian limit to fix the integration constant. The coordinate-singular surface g_tt(r) = 0 gives the Schwarzschild radius r_s = 2GM / c².

This is textbook general relativity — the kind of derivation every graduate student runs through before they're let near anything harder. Every one of the seven contestants reached the correct final formula. What distinguishes them is the integrity of the intermediate algebra.

Model	Response Length	Claims	Survival Rate	Confirmed / Flagged
Claude	5,918 ch	52	0.903	1 / 3
GPT-5.4	4,861 ch	52	0.946	0 / 2
o3-mini	4,070 ch	35	1.000	0 / 0
Gemini	5,140 ch	40	0.842	0 / 3
DeepSeek R1	5,003 ch	40	0.931	0 / 2
Qwen	4,492 ch	42	0.926	2 / 2
Grok	4,730 ch	38	0.917	1 / 2

What cross-family confirmation caught

Four errors that survived independent re-review.

Of 14 errors flagged by single adversaries, 4 held up under independent re-review by a second adversary from a different model family. Here are the real physics errors that survived cross-family confirmation.

Claude R_rr calculation claim #16

Claude's expression for the Ricci tensor component R_rr came out off by an overall factor of 2 compared to the standard derivation. The error propagates into the algebra that fixes e^λ but happens to cancel by the final step — the formula still lands on r_s = 2GM/c², which is why a final-answer check misses it.

First flag: GeminiConfirmed by: DeepSeek R1

Qwen metric ansatz claim #32

Qwen wrote e^λ = 1 / (1 + k/r) with the wrong sign of the integration constant. The standard derivation yields 1 / (1 - C/r); the sign choice depends on how the integration constant is defined relative to the mass parameter, but Qwen's statement conflicts with its own subsequent identification C = 2GM/c².

First flag: ClaudeConfirmed by: Grok

Qwen integration step claim #37 (second confirmed)

A second independently-confirmed Qwen claim in the same derivation — reviewers from two different families flagged an inconsistency in the step that integrates the vacuum field equation to yield the metric coefficients. See raw Layer 4/4b output on GitHub for the full verdict text; the error is of the same "sign-or-factor-drift-that-cancels" family as the first one.

First flag: GPT-5.4Confirmed by: Claude

Grok off-diagonal Ricci claim #36

Grok invoked R_tr to derive the constraint ∂_r(A'/A + B'/B) = 0. The problem: for a static, spherically symmetric metric of the form Grok wrote, R_tr vanishes identically — no such constraint arises from it. The needed relationship comes from the combination of R_tt and R_rr, not from an off-diagonal component that was zero to begin with.

First flag: DeepSeek R1Confirmed by: Qwen

What this tells us

Every model reached the right answer. Not every derivation was right.

All 7 of 7 contestants landed on the correct final formula r_s = 2GM / c². A final-answer-only check would grade every model correct.
Cross-family confirmation exposes 4 real derivation errors that benchmark-style evaluation misses. These are sign errors, factor drifts, and invocations of vanishing components — the kind of thing that cancels by the end but indicates the derivation is not actually sound.
Schwarzschild has the lowest confirmation rate of our three problems (28.6% vs. 33.3% Rindler, 30.0% Casimir). Our working hypothesis: this problem is so textbook-saturated that adversaries from different families share more training-data intuition about what the derivation "should" look like, suppressing cross-family disagreement on flagged errors.
Adversary lens matters more than contestant accuracy. o3-mini had 0 flagged errors, but also gave the shortest response (4,070 ch / 35 claims) — fewer intermediate claims means a smaller surface area to flag. Claims-density is not the same as rigor.

Raw Data

Audit every step.

Every Layer 1 response, the full claim decomposition, every Layer 4 verdict, and every Layer 4b confirmation for this problem is on GitHub:

github.com/themultivac/multivac-evaluation/tree/main/data/physics_synthesis/PHYSICS-SYNTH-schwarzschild-v0-20260418-160451

If you find an error in our methodology or our physics, email yash@themultivac.com. We would rather be corrected than be wrong.

Derive r_s from the Einstein field equations in vacuum.

Seven contestants. One formula. Different paths there.

Four errors that survived independent re-review.

Every model reached the right answer. Not every derivation was right.

Methodology Limitations

Audit every step.