The Last Question — № 003

The Casimir Force

F/A = -π² ℏ c / (240 d⁴)
Quantum Field Theory·Graduate level·7 Contestants·6 Confirmed / 20 Flagged
The Problem

A regulator choice. A formally infinite sum. A finite answer.

Two parallel perfectly conducting plates, separation d, infinite transverse extent. Between them, the quantized electromagnetic field. Each mode contributes ℏω/2 to the vacuum energy — and the total is formally infinite. The derivation asks the contestant to make sense of that infinity and extract the finite, physically meaningful piece.

The derivation therefore turns on a regularization choice: zeta-function regularization, exponential (or Gaussian) cutoff, dimensional regularization, or Euler-Maclaurin summation. Each choice requires a consistent subtraction of the free-vacuum contribution, a careful analytic continuation or cutoff-independence argument, and an extraction of the finite part. The final result F/A = -π²ℏc/(240 d⁴) is a cornerstone of experimental QFT — measured in 1997 by Lamoreaux to 5% accuracy.

Casimir is therefore conceptually harder than Schwarzschild (which is algebraic) or Rindler / Unruh (which is coordinate-chart construction plus mode decomposition). It tests how models handle regularization: do they state their regulator explicitly? Subtract consistently? Show cutoff independence? Confuse the regulator parameter with a physical length?

Results at a glance

Seven contestants. One regulator. Many ways to get it wrong.

Model Response Length Claims Survival Rate Confirmed / Flagged
Claude9,364 ch522 / 5
GPT-5.47,820 ch522 / 3
o3-mini3,424 ch420 / 5
Gemini6,109 ch280 / 2
DeepSeek R18,447 ch421 / 2
Qwen5,290 ch400 / 1
Grok6,755 ch301 / 2

Survival rates omitted for this problem; per-contestant rates were not computed in this run. Full per-claim verdicts are on GitHub.

What cross-family confirmation caught

Six errors that survived independent re-review.

Of 20 errors flagged by single adversaries, 6 held up under independent re-review by a second adversary from a different model family. The errors cluster where you'd expect them: confusion between regulator parameters and physical lengths, misapplied Euler-Maclaurin corrections, and overclaims about the rigor of cutoff-independence arguments.

Claude cutoff independence claim #27

Claimed "all higher-order Euler-Maclaurin corrections vanish, justifying the regularization". This overstates the argument. The cutoff-independence of the finite part comes from subtracting the divergent pieces of the free-vacuum contribution, not from the Euler-Maclaurin corrections automatically vanishing. The finite answer is recovered; the justification attached to it is wrong.

First flag: DeepSeek R1Confirmed by: GPT-5.4
Claude Euler-Maclaurin claim #28

Wrote the Euler-Maclaurin summation correction as -1/180, but omitted the necessary F'''(0) factor. The correction term is a Bernoulli coefficient multiplied by a derivative of the summand evaluated at zero — dropping the derivative misapplies the correction.

First flag: GrokConfirmed by: Gemini
GPT-5.4 regulator vs plate separation claim #33

Wrote U/A = -ℏc a³ / (720π), confusing the regularization cutoff parameter a with the plate separation d. This is the canonical trap of regularized field theory: the regulator is a book-keeping parameter that must drop out of the final answer. Here the dimensional analysis alone flags it — and the π coefficient is also wrong.

First flag: ClaudeConfirmed by: DeepSeek R1
GPT-5.4 meta-claim claim #48

A self-correction meta-claim confirmed by two adversaries as "ERROR". On manual re-reading this is a weaker case than the other five — the model is commenting on its own earlier move rather than making a primary derivation step. We include it in the confirmed count for methodological consistency, but it should be treated as a weaker signal and not as an unambiguous physics error.

First flag: GrokConfirmed by: Claude
DeepSeek R1 regulator notation claim #21

Introduced two different regulator symbols (a and β) in adjacent steps without defining either precisely, then shifted between them inconsistently. The final result is nonetheless correct, but the intermediate derivation is not self-consistent in its regulator bookkeeping.

First flag: QwenConfirmed by: Claude
Grok zeta regularization argument claim #11

Wrote the zeta-regularized sum evaluated at Z(-1). But the parallel-plates mode sum of -like weights analytically continues to ζ(-3), not ζ(-1). ζ(-3) = 1/120; ζ(-1) = -1/12. Choosing the wrong ζ argument changes the final coefficient — and yet Grok still wrote the correct final π²/240 coefficient, which is inconsistent with what would have followed from the stated argument.

First flag: DeepSeek R1Confirmed by: GPT-5.4
What this tells us

The confirmation rate across three problems lands in a narrow band.

Methodology Limitations

Read this before citing any number on this page. These limitations are ordered roughly by how much they should move your priors.

• No symbolic verification

Every verdict in this pipeline is produced by an LLM reading LaTeX. There is no SymPy check, no theorem prover. "Confirmed error" means two LLMs from different families independently flagged a step as wrong; it does not mean the step is mathematically wrong. Layer 3 (symbolic verification) is the top roadmap item.

• Stochastic verdicts

Running Layer 4b twice on the same output can produce different confirmation counts due to LLM sampling variance (observed ±20% across reruns). Reported numbers are single-run.

• Parse failure rate

Approximately 33% of Layer 4b calls returned unparseable responses (primarily from o3-mini on OpenRouter). Parse failures are treated as non-confirmations — which systematically under-counts real errors.

• Manual verification required

Confirmed errors should be manually spot-checked before use. Not all "confirmed" errors are unambiguous physics errors — some are notation or convention disagreements between adversaries, and at least one Casimir "confirmed" error is a meta-claim about self-correction rather than a primary derivation step.

• Ground truth in prompt

The final-answer check is a regex for the formula the problem statement asked the model to derive. Passing this check means the model produced a matching formula, not that the derivation constitutes a valid proof.

Raw Data

Audit every step.

Every Layer 1 response, the full claim decomposition, every Layer 4 verdict, and every Layer 4b confirmation for this problem is on GitHub:

github.com/themultivac/multivac-evaluation/tree/main/data/physics_synthesis/PHYSICS-SYNTH-casimir-v0-20260418-180335

If you find an error in our methodology or our physics, email yash@themultivac.com. We would rather be corrected than be wrong.