Quantum Field Theory·Graduate level·7 Contestants·6 Confirmed / 20 Flagged
The Problem
A regulator choice. A formally infinite sum. A finite answer.
Two parallel perfectly conducting plates, separation d, infinite transverse extent. Between them, the quantized electromagnetic field. Each mode contributes ℏω/2 to the vacuum energy — and the total is formally infinite. The derivation asks the contestant to make sense of that infinity and extract the finite, physically meaningful piece.
The derivation therefore turns on a regularization choice: zeta-function regularization, exponential (or Gaussian) cutoff, dimensional regularization, or Euler-Maclaurin summation. Each choice requires a consistent subtraction of the free-vacuum contribution, a careful analytic continuation or cutoff-independence argument, and an extraction of the finite part. The final result F/A = -π²ℏc/(240 d⁴) is a cornerstone of experimental QFT — measured in 1997 by Lamoreaux to 5% accuracy.
Casimir is therefore conceptually harder than Schwarzschild (which is algebraic) or Rindler / Unruh (which is coordinate-chart construction plus mode decomposition). It tests how models handle regularization: do they state their regulator explicitly? Subtract consistently? Show cutoff independence? Confuse the regulator parameter with a physical length?
Results at a glance
Seven contestants. One regulator. Many ways to get it wrong.
Model
Response Length
Claims
Survival Rate
Confirmed / Flagged
Claude
9,364 ch
52
—
2 / 5
GPT-5.4
7,820 ch
52
—
2 / 3
o3-mini
3,424 ch
42
—
0 / 5
Gemini
6,109 ch
28
—
0 / 2
DeepSeek R1
8,447 ch
42
—
1 / 2
Qwen
5,290 ch
40
—
0 / 1
Grok
6,755 ch
30
—
1 / 2
Survival rates omitted for this problem; per-contestant rates were not computed in this run. Full per-claim verdicts are on GitHub.
What cross-family confirmation caught
Six errors that survived independent re-review.
Of 20 errors flagged by single adversaries, 6 held up under independent re-review by a second adversary from a different model family. The errors cluster where you'd expect them: confusion between regulator parameters and physical lengths, misapplied Euler-Maclaurin corrections, and overclaims about the rigor of cutoff-independence arguments.
Claudecutoff independenceclaim #27
Claimed "all higher-order Euler-Maclaurin corrections vanish, justifying the regularization". This overstates the argument. The cutoff-independence of the finite part comes from subtracting the divergent pieces of the free-vacuum contribution, not from the Euler-Maclaurin corrections automatically vanishing. The finite answer is recovered; the justification attached to it is wrong.
First flag: DeepSeek R1Confirmed by: GPT-5.4
ClaudeEuler-Maclaurinclaim #28
Wrote the Euler-Maclaurin summation correction as -1/180, but omitted the necessary F'''(0) factor. The correction term is a Bernoulli coefficient multiplied by a derivative of the summand evaluated at zero — dropping the derivative misapplies the correction.
First flag: GrokConfirmed by: Gemini
GPT-5.4regulator vs plate separationclaim #33
Wrote U/A = -ℏc a³ / (720π), confusing the regularization cutoff parameter a with the plate separation d. This is the canonical trap of regularized field theory: the regulator is a book-keeping parameter that must drop out of the final answer. Here the dimensional analysis alone flags it — and the π coefficient is also wrong.
First flag: ClaudeConfirmed by: DeepSeek R1
GPT-5.4meta-claimclaim #48
A self-correction meta-claim confirmed by two adversaries as "ERROR". On manual re-reading this is a weaker case than the other five — the model is commenting on its own earlier move rather than making a primary derivation step. We include it in the confirmed count for methodological consistency, but it should be treated as a weaker signal and not as an unambiguous physics error.
First flag: GrokConfirmed by: Claude
DeepSeek R1regulator notationclaim #21
Introduced two different regulator symbols (a and β) in adjacent steps without defining either precisely, then shifted between them inconsistently. The final result is nonetheless correct, but the intermediate derivation is not self-consistent in its regulator bookkeeping.
First flag: QwenConfirmed by: Claude
Grokzeta regularization argumentclaim #11
Wrote the zeta-regularized sum evaluated at Z(-1). But the parallel-plates mode sum of n³-like weights analytically continues to ζ(-3), not ζ(-1). ζ(-3) = 1/120; ζ(-1) = -1/12. Choosing the wrong ζ argument changes the final coefficient — and yet Grok still wrote the correct final π²/240 coefficient, which is inconsistent with what would have followed from the stated argument.
First flag: DeepSeek R1Confirmed by: GPT-5.4
What this tells us
The confirmation rate across three problems lands in a narrow band.
30.0% confirmation rate — almost exactly matching Rindler (33.3%) and close to Schwarzschild (28.6%). Our three problems span textbook GR, coordinate-chart-plus-QFT, and regulator-choice QFT — very different difficulty textures. The confirmation rate is surprisingly stable across them.
A ~30% band suggests single-adversary flags systematically overstate real errors by ~3.3×. That ratio is not a universal constant — it will shift with adversary panel size, prompt calibration, and problem type — but the current pipeline's false-positive ratio is in that rough ballpark.
The "discriminative power scales with problem novelty" hypothesis did not survive bug-fixed re-analysis. We initially observed a larger spread across the three problems in a buggy Layer 4b run; the fixed run lands them in a narrow band.
1 of the 6 Casimir "confirmed" errors was a meta-claim about self-correction (GPT-5.4 claim #48), which manual review treats as a weaker case. The honest unambiguous-physics-error count is therefore 5 of 20 (25%), which is consistent with our headline that manual review further narrows confirmed-rate claims by roughly 15-25%.
The real headline: cross-family confirmation recovers roughly 30% of flagged errors at the current pipeline settings, and manual review narrows this to 15-25% unambiguous physics errors — meaningfully better signal than a final-answer check, and meaningfully worse than a symbolic verifier. Which is why Layer 3 is next.
Methodology Limitations
Read this before citing any number on this page. These limitations are ordered roughly by how much they should move your priors.
• No symbolic verification
Every verdict in this pipeline is produced by an LLM reading LaTeX. There is no SymPy check, no theorem prover. "Confirmed error" means two LLMs from different families independently flagged a step as wrong; it does not mean the step is mathematically wrong. Layer 3 (symbolic verification) is the top roadmap item.
• Stochastic verdicts
Running Layer 4b twice on the same output can produce different confirmation counts due to LLM sampling variance (observed ±20% across reruns). Reported numbers are single-run.
• Parse failure rate
Approximately 33% of Layer 4b calls returned unparseable responses (primarily from o3-mini on OpenRouter). Parse failures are treated as non-confirmations — which systematically under-counts real errors.
• Manual verification required
Confirmed errors should be manually spot-checked before use. Not all "confirmed" errors are unambiguous physics errors — some are notation or convention disagreements between adversaries, and at least one Casimir "confirmed" error is a meta-claim about self-correction rather than a primary derivation step.
• Ground truth in prompt
The final-answer check is a regex for the formula the problem statement asked the model to derive. Passing this check means the model produced a matching formula, not that the derivation constitutes a valid proof.
Raw Data
Audit every step.
Every Layer 1 response, the full claim decomposition, every Layer 4 verdict, and every Layer 4b confirmation for this problem is on GitHub: