Adversarial review notesApril 4, 2026

We Asked Four Models to Break Our Math. Two of Them Did.

Running the Epistemic Compression Score framework through two rounds of adversarial review — what held, what didn't, and what we didn't catch ourselves.

Nine days in to learning how large language models work and some math started mathing. The ensemble divergence experiment has real data. The Epistemic Compression Score has a formal definition. Time to find out if the math actually holds.

We ran the ECS framework through two rounds of adversarial review using models from different lineages as reviewers. Here's what happened.

The Finding

Before the process: the result that matters most.

The invariance check exposed that 16 of 20 models in the Run 3 ensemble cannot be robustly classified as compressors or dissectors. Their classification depends on which anchor value you use for the contested pair tier. Only four models are stable across the full anchor range:

Always compressors: Tiny Aya, Phi-4
Always dissectors: Skywork Pro, Z-glm-5

Everyone else crosses the sign boundary depending on whether you anchor contested pairs at 0.30, 0.50, or 0.70. That's not a small caveat. That's most of the ensemble.

The framework still works — it just means you can't slap a compressor/dissector label on a model and call it done. The four stable models are genuinely stable. The finding about the epistemic instability gradient pattern (models disagree more on contested pairs than calibration pairs — calibration stdev 0.057, contested stdev 0.208) doesn't move. D15 oral tradition remains the highest-variance pair in the dataset regardless of anchor. Those are robust. The per-model classification is anchor-dependent for the middle of the distribution.

We knew the 0.50 anchor was an assumption. The invariance check quantified exactly how much it matters.

The Process

Round one: Where it all began... GPT4 (native api) Was front loaded with a context injection packet about the Lossyscape Framework, context injection best practices, the Bias Signal Assessment protocol, and raw numbers from rounds one through three of Divergence Experiments. The model kicked out what may have been a hallucination of formulae so I contacted Nemotron3 for assistance. It ran the numbers cold and saw a signal, from that Versions 0.1 - 02 were born from cross examination with Claude Sonnet 4.6 bearing a very heavy 72 hour load of Framework and BSA data alongside GPT5.2 (DeepThink/ Web on high deepseek api). The next step was to get mathing. Hard. I sent the ECS framework document and the Run 3 score matrix to Nemotron-3-Super-120B through the Perplexity API. Clean session. No Atlas context. Just the math and the numbers.

Nemotron came back with four findings: the T2 anchor lacks formal justification, the proposed T1 recalibration introduces ensemble dependence, the RLHF correlation was overinterpreted, and the PCA axis labels are over-committed without a rotation check. All four were correct. We incorporated them into v0.3 of the technical addendum.

Round two: Built a full cross-examination packet — the v0.3 document, the complete Divergence Experiment raw run 3, 30×20 score matrix, the computed Epistemic Compression values, and the prior Nemotron findings with document responses — sending it to GPT-5.2 and Nemotron-3 cold. Different session, no prior context, a very restrictive prompt, and five specific questions including: verify the math independently, run the invariance check, compute the relative compression gradient for all 20 models.

GPT-5.2 flagged a discrepancy in the Tiny Aya computed values. Nemotron-3 got the same numbers as the original table. Source CSV recount confirmed: Nemotron was right, GPT miscounted in manual arithmetic. That's something worth looking into.

Both models independently identified something we hadn't caught: the epistemic compression score sign interpretation framework is only fully valid for the contested pair tier. For calibration pairs, ECS can never be positive — compression is mathematically impossible when the anchor is 1.0 and scores are bounded at 1.0. For foil pairs, ECS can never be negative. The document was treating the sign interpretation as universal when it only fully applies to the middle tier. That goes in v0.4.

GPT-5.2 also noted that the Structural Dissector Index and Topic Matcher Index — two behavioral classification metrics — are algebraically identical. SDI + TMI = 1 by construction. They carry the same information. Presenting them as two separate diagnostic instruments implies independence that doesn't exist. One of them is redundant.

Both findings are in v0.4.

What Changed

The core formula didn't move. ECS(i,m) = S_model(i,m) − S_ground(i). The epistemic instability gradient is real. The D15 oral tradition finding is robust. The ensemble divergence data holds.

What changed is the precision of the claims around it:

The invariance check is now a required step before interpreting any per-model ECS classification — not optional. If the check shows anchor dependence (it does, for 16 of 20 models), you can only make robust claims about the extreme ends of the distribution.

The RLHF correlation finding is now a documented limitation rather than a finding. Binary coding, n=20, r≈0.03 establishes limited power, not independence.

The PCA axis labels are working interpretations pending rotation check — not findings.

The sign interpretation for ECS now specifies which tier it applies to fully.

The technical addendum is at v0.4. The version history is in the document. The audit trail is the point.

What's Next

Run 4 is designed. 17 models, same 30 pairs, different lineage mix — heavier on OpenAI variants including reasoning models, Amazon intra-lab comparison (Nova-2-Lite vs Nova Pro), Nvidia Nemotron, Arcee Trinity. The reasoning model question is the one to watch: do o3 and o4 Mini score the high-divergence pairs differently from standard completion models? Chain-of-thought reasoning should increase sensitivity to structural contradiction. That's a testable prediction.

Run 5 will need Tier 3 foil pairs — fabricated claims that sound plausible. Run 3 has no T3 pairs, which means the full Compression Gradient formula can't be computed. Run 5 fixes that.

Raw data and session documentation are available on request. See the outreach page if you want to get in touch.

Atlas Heritage Systems Inc. — Endurance. Integrity. Fidelity.