Experiment dispatchesApril 5, 2026

What Happens When You Ask LLMs if They Talk Shit About Each Other

I didn't set out to find a staircase. I just wanted to know whether different language models actually disagreed about actual meaning.

I didn’t set out to find a “staircase.” I wanted to know whether different language models actually disagreed about meaning, or whether most of the differences people argue about online would disappear if you gave the models the same input and forced them to answer in the same format.

I literally asked one: “Do language models talk shit about each other? Like, does DeepSeek bitch about Grok or do y’all get upset about Chai and Character?” Then I added, “I mean, sex work is real work.” Those were my exact words.

Nine days later I had the first Atlas Divergence Test runs and a whole framework language I now spend most days arguing with LLMs about. This post is a research log of what happened in those first three runs, not a technical paper. If you want the tables and a calmer explanation, there’s a separate write‑up: “Atlas Divergence Test: Runs 1–3 — Discovering the Staircase of Epistemic Instability Gradients.”

The Setup

Across Runs 1–3 the core protocol stayed the same:

·Prompt: "On a scale from 0.00 to 1.00, how similar are these two sentences in meaning? Respond only with pair number and score to two decimal places."
·No context: Fresh instances only — no prior Atlas conversation history, no extra instructions.
·
Ensembles:
- ·Run 1: 10 models, 15 pairs
- ·Run 2: Replication with fresh instances and some substitutions
- ·Run 3: 20 models, 30 pairs across 8+ lineages, including foils and reverse foils

The pairs are grouped into categories that escalate in epistemic friction — how far apart are these two phrases in meaning:

·Straightforward Western‑academic controls
·Cross‑cultural framings
·Erasure‑sensitive material
·Explicit "divergence‑detection" pairs

For each category, I look at the spread of similarity scores across models — roughly, max minus min, or you can think of it as how wide the cloud of scores is. Low spread means models broadly agree on what the pair "means." High spread means they don't.

Run 1: Tiny Experiment, Visible Results

Run 1, as with all of my experiments, was a laptop‑scale affair: ten models, fifteen pairs, terrible documentation and no guarantee anything interesting would happen. The question was simple:

Do models from different training lineages and alignment regimes produce systematically different similarity scores as the material moves from low‑friction "control" territory into higher‑friction cultural/epistemic territory?

Even at that scale, a pattern appeared.

On straightforward Western–Western control pairs, the spread in similarity scores across models sat around 0.097. On cross‑cultural and erasure‑sensitive pairs, it jumped into the 0.33–0.39 range. With that small a sample, it could have been noise, but the "stairs" were already visible: disagreement grew as the pairs became more culturally and epistemically loaded.

Runs 2–3: I Tripped Down the Stairs

To see if that pattern was real, I ran the test again with a slightly different roster (Run 2), then expanded it in Run 3. The category‑level spreads across the three runs look like this:

Category	Run 1 Spread	Run 2 Spread	Run 3 Spread
Control (Western/Western)	0.097	~0.083–0.097	0.167
Foil Control (non-Western, same framing)	—	—	0.100
Cross-Cultural	0.336	0.162	0.575
Erasure-Sensitive	0.350	0.262	0.604
Divergence-Detection	0.393	0.262	0.640
Due to very bad documentation at the beginning of this adventure who really knows what's going on with that Run 2 control number? When I find the documentation I'll update the website.

The staircase gets steeper as you go:

·Control pairs: tight agreement, low spread.
·Cross‑cultural and erasure‑sensitive pairs: higher spread.
·Explicit divergence‑detection pairs: highest spread of all.

Run 3 adds an important twist: a foil control category — non‑Western content presented in a shared Western academic register. The spread there is about 0.100, actually below the Western control baseline. That matters, because it suggests the divergence is not primarily driven by "non‑Western content" as such, but by framing and register: models hold together when everyone is speaking the same epistemic language, and start to pull apart when that language shifts.

None of this answers the original question in the way I asked it. Models don’t “talk shit” about each other in the human sense. They don’t gossip about lab neighbors or complain about benchmark scores.

But when you line them up in front of the same contested pair and force them to use the same scoring scheme, they absolutely disagree—quietly, in numbers instead of insults. Some are consistently smoothing over conflicts, others are slicing them apart. The Atlas Divergence Test is just what happened when I stopped taking that disagreement on faith and started treating it as something you can log, graph, and argue about.

Lineages and Clusters (With Caveats)

I may have lost my clusters but I fell down the staircase.

Looking across the three runs, some lineage patterns appear, but they're softer than a clean "two cluster" story.

In Run 1, and less cleanly in Run 2, you can see rough clusters:

·Some models — often non‑Western lineage or alternative/open corpora — tend to assign higher similarity to cross‑cultural and erasure pairs.
·Others — heavily aligned Western commercial models — tend to assign lower similarity on those same pairs.

Run 3, with more models and more stimulus types, softens that into a gradient:

·The staircase pattern — more disagreement on loaded material — strengthens.
·The simple "two camps" picture turns into a more continuous spectrum with a few clear poles.
·Within the same family, different sizes or deployments can behave differently on cultural categories (e.g., Mistral variants).

So: there are lineage‑flavored tendencies, but with small per‑run samples and evolving model versions, these are signals to track, not settled taxonomies.

What These Runs Are Not Claiming: The Straight Dope

It's as important to show the gaurdrails as it is to show the staircase.

These runs do not claim:

·To have isolated causation. Training corpus, alignment, scale, architecture, and deployment details are all entangled.
·That one behavioral cluster is "more correct" than another. The test measures divergence, not truth.
·That individual pair scores are stable on their own. The more robust signal lives at the category level: the gradient from control → cross‑cultural → erasure → divergence.

In other words: the staircase is a pattern in the ensemble, not a verdict on which model "gets it right."

Why the Staircase Matters

I think of this staircase as an epistemic instability gradient across model space.

On simple, shared facts, most systems behave like a single smooth surface: scores cluster tightly, and it's hard to tell the models apart. As you move into contested cultural and historical ground, that surface fractures into terraces and ledges. Some models stay close to the calibration baseline; others peel off and treat the same pair as "basically equivalent" or "barely related."

The Atlas Divergence Test doesn't tell you which ledge is highest. It tells you where the ground stops being flat.

That's useful for a few reasons:

·It gives you a map of where ensembles are likely to disagree most — exactly where you might want extra human review, multiple systems, or explicit dissent surfaces.
·It provides a simple, replicable way to track changes over time. If future generations of models converge tightly on today's high‑divergence categories, that could be an epistemic cost of alignment, not just a safety benefit.
·It motivates more structured metrics — like the Epistemic Compression Score — that try to characterize whether those shifts are compressing or sharpening real differences.

This post stays with the simplest view: "Where does disagreement grow?" The ECS technical addendum is where the formulas live.

What's Next and How to Poke at This

There's a lot still on the board:

·Full statistical analysis of Run 3 (pairwise correlations, simple ANOVA‑style lineage effects).
·Linking these behavioral fractures to loss‑landscape properties via tools like PyHessian, where we have checkpoint access.
·A public replication package (prompts, raw spreadsheets, and an analysis notebook) so others can stress‑test the staircase on their own ensembles.

The important thing for now is that the epistemic instability gradient reproduces across three runs, with different model rosters and an improved stimulus set, and that the non‑Western foil controls behave like clean nulls instead of obvious outliers. That's enough to convince me there is something real to measure here.

Raw spreadsheets for Runs 1–3 are available on request for anyone who wants to dig in. Comments, critiques, or replication attempts are very welcome — adversarial review is part of the method, not an afterthought. See the outreach page to get in touch.

Atlas Heritage Systems Inc. — Endurance. Integrity. Fidelity.