Epistemic Compression Score: When Models Make Opposites Sound the Same
Large language models are very good at making things sound smooth. Sometimes that's exactly what we want. Sometimes it is the problem.
Large language models are very good at making things sound smooth. Sometimes that's exactly what we want. Sometimes it is the problem.
There is a specific failure mode I care about: when a model collapses real disagreements into fake agreement. Two claims are far apart in reality, but the model treats them as "basically the same." That's what I call epistemic compression, and Epistemic Compression Score (ECS) is my attempt to measure it.
The Problem: Epistemic Compression
Modern models are trained mostly on English-language Western sources and then aligned to be broadly helpful and non‑confrontational. That combination encourages a particular kind of smoothing:
- ·If two stories are about the same topic,
- ·And one comes from the majority frame and one from a minority or indigenous frame,
- ·The model learns that treating them as "compatible perspectives" is usually rewarded more than surfacing the conflict between them.
Over time, the model starts to flatten distinctions between epistemologically incompatible claims. Contradictions get turned into "different angles on the same thing."
The Atlas framework describes this as remagnetization: the model's probability mass being pulled toward the statistical center, away from the edges where idiosyncratic and culturally specific expression lives. Epistemic Compression Score is a way of putting a number on how strong that pull is for a given pair of claims.
What Epistemic Compression Score Measures
At a high level, ECS answers a simple question:
"For this pair of claims, is the model making them sound too similar or too different, relative to how they actually relate?"
To do that, you need three pieces:
Ground truth structural similarity. Before you ask the model anything, you decide how similar the pair should be, given reality. That can come from tier labels in your experiment design, human annotations, or simple structural tags like "contradiction," "rephrase," "independent." This gives you a target similarity score: the anchor.
Model-perceived similarity. You ask the model to rate how similar the two claims are in meaning on a 0.00–1.00 scale. That's what the model thinks their relationship is.
The gap between the two. If the model's score is much higher than the anchor, it's over‑smoothing: compressing real difference. If it's much lower, it's over‑separating: exaggerating difference where there isn't much. If it matches, it's calibrated on that pair.
Formally, the Epistemic Compression Score for a pair is a signed number capturing this gap:
- ·Positive ECS → the model says "these are closer than they should be" (compression).
- ·Negative ECS → the model says "these are further apart than they should be" (dissection).
- ·Zero ECS → the model is aligned with the ground truth anchor on that pair.
ECS is defined per pair, per model. It does not replace ensemble disagreement (standard deviation across models). It complements it: stdev tells you "how much models disagree with each other," ECS tells you "how this particular model disagrees with reality, and in which direction."
A Concrete Example
Take a familiar epistemic clash, simplified:
Claim A: "Oral traditions are unreliable historical sources because they change with each retelling."
Claim B: "Oral traditions are high‑fidelity transmission systems that encode information in rhythm, repetition, and social performance, changing in surface detail while preserving deep structure across generations."
In most experimental designs, this pair is anchored as structurally opposed: one treats change as noise, the other treats change as part of an error-correcting code. Ground truth structural similarity should be very low.
Now imagine two models:
- ·Model 1 gives this pair a similarity score of 0.80 ("they're basically the same idea: oral stories change over time").
- ·Model 2 gives it a 0.05 ("these are almost opposites").
Both are fluent. Both can explain their answers. But in ECS terms:
Model 1 has a strongly positive ECS on this pair: it has compressed a deep disagreement into a fake consensus. Model 2 has a strongly negative ECS: it keeps the conflict sharp.
If all you look at is the average similarity score across models, you might conclude "the ensemble thinks these two claims are moderately similar." ECS lets you see that the ensemble average is hiding the fact that some models are collapsing the difference and others are preserving it.
How ECS Fits With Divergence
In the Atlas Divergence Test runs, the main metric so far has been spread: standard deviation of similarity scores across models for each pair, aggregated by category. That's where the staircase comes from — controls low spread, cross‑cultural higher, erasure and epistemic clashes highest.
ECS adds another layer:
Divergence (spread) tells you: "On this pair, do models agree with each other or not?"
ECS tells you: "On this pair, is this particular model leaning toward over‑smoothing or over‑separating, relative to the anchor?"
For example:
- ·You might see high spread on a pair because some models are strongly compressing and others are strongly dissecting.
- ·You might see low spread on a pair where all models are compressing in the same direction, which is even more worrying from an epistemic standpoint.
The combination of divergence and ECS lets you see both where the ensemble fractures, and where it collapses in one direction together.
What Happened When We Actually Computed ECS
The v0.4 addendum reports on running ECS over a 20‑model dataset, plus adversarial reviews from Nemotron‑3 and a separate read from GPT‑5.2. Three technical findings matter for the story (without the math):
Anchor matters, but not for everyone. When you change how you define the ground truth anchor, most models change the sign of their ECS on key pairs at least once. Only a small subset are robustly positive (always compressing) or robustly negative (always dissecting), regardless of which reasonable anchor you use. In this dataset, Tiny Aya and Phi‑4 are robustly positive; Skywork Pro and Z‑glm‑5 are robustly negative.
Variance is local. The algebraic relationship between ECS and raw model scores holds within a pair, not across all pairs. That means you can't just average ECS blindly across every item; you need to respect category and anchor design.
Derived indices are not independent. Two of the proposed classification indices turned out to be linear transforms of the same underlying quantity. They're still useful for summarizing behavior, but they're not independent metrics; the addendum flags that explicitly.
There's also a quietly reassuring result: when GPT‑5.2 and Nemotron‑3 were given the full addendum and raw score matrix with no Atlas context, they agreed on a data discrepancy that turned out to be a manual arithmetic error, not a conceptual flaw. The underlying data were correct.
**All 30 Run 3 pairs were originally drafted by a single language model under prompt constraints that specified tier and epistemic structure (ALIGN, CONTEST, ORTHO, CONTRA). That drafting model is excluded from all subsequent scoring runs; ECS and divergence are computed only from models that encounter the pairs as fixed human‑supplied text.
Why Any of This Matters
ECS is a technical tool, but it's aimed at a very practical question:
"If we align models to be smoother and more agreeable, how do we tell when they've started erasing important differences?"
Without a metric like ECS, it's easy to look at a model that produces polite, coherent answers about contested topics and conclude "it understands both sides." In reality, it may be compressing them into a bland midpoint that satisfies preference data but destroys the structure of the underlying knowledge.
By combining ensemble divergence (where do models disagree with each other?) and Epistemic Compression Score (where does this model disagree with the ground truth anchor, and in which direction?), you get a first rough map of where alignment is acting as a cultural flattening layer rather than just a safety improvement.
Where the Math Lives
The full technical details — formulas, edge cases, anchor definitions, invariance checks, and cross‑examination notes — live in the Epistemic Compression Score Technical Addendum v0.4. It is written so that someone with the raw score matrix and no other context can recompute everything from scratch.
The blog version is just the story: models can make opposites sound the same. That's a measurable behavior, not a vibe. ECS is one way to put a number on it, so we can stop arguing in the abstract and start looking at where, and how much, compression is actually happening.
Raw data and session documentation available on request. See the outreach page to get in touch.
Atlas Heritage Systems Inc. — Endurance. Integrity. Fidelity.