QFT in Curved Spacetime·Graduate level·7 Contestants·7 Confirmed / 21 Flagged
The Problem
Two parts. Rindler geometry, then Unruh thermality.
A two-part problem. Part 1: construct the coordinate system of a uniformly accelerating observer in Minkowski spacetime, derive the Rindler line element, and identify the Rindler horizon. Part 2: quantize a massless scalar field in the Minkowski vacuum, decompose it separately in Minkowski and Rindler mode bases, compute the Bogoliubov coefficients, and extract the thermal spectrum the accelerating observer reads — yielding T = ℏa / (2π c k_B).
This is the archetypal QFT-in-curved-spacetime derivation. It exercises coordinate-chart construction, mode decomposition over non-global Cauchy slices, analytic continuation in the complex ζ-plane, the |βωω'|² absolute-value step, and the step from a thermal-looking spectrum to a thermal spectrum. The errors are rarely in the final formula; they are in the handling of dimensions, branch cuts, left-wedge vs. right-wedge supports, and analytic continuation choices.
Six of seven contestants reached both final answers. Gemini 3.1 Pro reached Part 1 but its response truncated at the token budget before Part 2 completed — a budgeting failure, not a physics failure.
Results at a glance
Seven contestants. Two final answers. Six completions.
Model
Response Length
Claims
Survival Rate
Confirmed / Flagged
Claude
—
68
0.947
1 / 2
GPT-5.4
—
72
0.932
3 / 4
o3-mini
—
42
0.708
1 / 7
Gemini *
4,661 ch (truncated)
42
0.864
0 / 3
DeepSeek R1
—
62
0.958
0 / 1
Qwen
—
52
0.923
0 / 2
Grok
—
52
0.920
2 / 2
* Gemini 3.1 Pro reached Part 1 (Rindler transform) but response was truncated at max_tokens=6144 before Part 2 completed. Future runs use max_tokens=12288 for reasoning-enabled models.
What cross-family confirmation caught
Seven errors that survived independent re-review.
Of 21 errors flagged by single adversaries, 7 held up under independent re-review by a second adversary from a different model family — the highest confirmation rate of our three problems at 33.3%. The errors cluster around dimensional consistency, analytic continuation, and support regions in the Rindler wedge decomposition.
Claudedimensional consistencyclaim #8
Claude wrote "constant proper acceleration a requires dφ/dτ = a" — missing a factor of c. The dimensionally correct relation for Rindler rapidity is dφ/dτ = a/c. A minor bug that cascades into the identification of the acceleration parameter in the Rindler metric.
First flag: GPT-5.4Confirmed by: DeepSeek R1
GPT-5.4Rindler transformclaim #25
Inverse transformation written as η = (c/2) ln((x+ct)/(x-ct)) — missing a factor of 1/a. The coordinate η should be dimensionless (as a rapidity-like variable), but this form has dimensions of length.
First flag: ClaudeConfirmed by: Qwen
GPT-5.4argument dimensionsclaim #26
Wrote sinh(η/c) as part of the Rindler-to-Minkowski transformation. The argument of a hyperbolic function must be dimensionless; η/c is not. The correct combination is aη/c, which is dimensionless.
First flag: DeepSeek R1Confirmed by: Claude
GPT-5.4proper time scalingclaim #31
Wrote dτ = (ξ₀/c²) dη. Proper time should scale as length/c, not length/c². Another c-factor drift in the chain that converts between affine parameter and proper time along a uniformly accelerated worldline.
First flag: QwenConfirmed by: DeepSeek R1
o3-miniinitial-condition conflationclaim #6
Conflated "observer at rest in Rindler frame" (always true by construction — the Rindler observer is by definition at fixed ξ) with the standard initial condition "at rest in Minkowski frame at τ = 0". These are two different statements, and only the latter is an independent constraint that fixes the initial conditions.
First flag: ClaudeConfirmed by: Grok
DeepSeek R1wedge supportclaim #47
Left-wedge modes written as θ(V) V-iΩ. But in the left Rindler wedge V < 0 everywhere, so θ(V) = 0 identically — making the left-wedge mode a zero function. The correct support is θ(-V)(-V)-iΩ, which is nonzero exactly where the left wedge lives.
First flag: GPT-5.4Confirmed by: Claude
Grokderivation rigor claimclaim #47
Claimed "all steps up to |β|² are exact". They aren't — the derivation invokes non-trivial analytic continuation into the complex ζ-plane and branch-cut choices. These are legitimate moves made in every standard derivation, but they are not exact; they are analytic-continuation arguments with specific regularity assumptions.
6 of 7 contestants reached both final answers. Only Gemini 3.1 Pro didn't — and that was due to the 6,144-token budget clipping Part 2, not a physics failure. We have raised the budget for reasoning-enabled models going forward.
Rindler / Unruh has the highest confirmation rate of our three problems (33.3%). Dimensional bookkeeping, branch-cut conventions, and wedge-support supports produce more cross-family-robust disagreements than textbook algebra does.
A dedicated STEM-reasoning model (o3-mini) had the lowest survival rate (0.708) and the most flagged errors (7) — though only 1 of those 7 held up under confirmation. Interpretation: o3-mini's chain-of-thought surfaces more intermediate claims that adversaries can flag, even when most of those flags don't survive second-round scrutiny.
DeepSeek R1 had the highest survival rate (0.958) despite also being a reasoning-oriented model. "Reasoning model" is not a monolithic category with a monolithic effect on derivation quality.
GPT-5.4 had the most confirmed errors (3) and a survival rate of 0.932 — a reminder that reaching the right final answer is compatible with making real errors on the way there, and that a capable adversary panel can surface them.
Methodology Limitations
Read this before citing any number on this page. These limitations are ordered roughly by how much they should move your priors.
• No symbolic verification
Every verdict in this pipeline is produced by an LLM reading LaTeX. There is no SymPy check, no theorem prover. "Confirmed error" means two LLMs from different families independently flagged a step as wrong; it does not mean the step is mathematically wrong. Layer 3 (symbolic verification) is the top roadmap item.
• Stochastic verdicts
Running Layer 4b twice on the same output can produce different confirmation counts due to LLM sampling variance (observed ±20% across reruns). Reported numbers are single-run.
• Parse failure rate
Approximately 33% of Layer 4b calls returned unparseable responses (primarily from o3-mini on OpenRouter). Parse failures are treated as non-confirmations — which systematically under-counts real errors.
• Manual verification required
Confirmed errors should be manually spot-checked before use. Not all "confirmed" errors are unambiguous physics errors — some are notation or convention disagreements between adversaries.
• Ground truth in prompt
The final-answer check is a regex for the formula the problem statement asked the model to derive. Passing this check means the model produced a matching formula, not that the derivation constitutes a valid proof.
Raw Data
Audit every step.
Every Layer 1 response, the full claim decomposition, every Layer 4 verdict, and every Layer 4b confirmation for this problem is on GitHub: