Divergence Testing and LLMs: Simple Science for Not‑So‑Simple Ideas
What the Atlas Divergence Test is, what it has found across three runs, and what it is actually claiming.
What the Divergence Test is
The Atlas Divergence Test is a black-box experiment that measures how much different language models disagree when they are asked to judge the meaning similarity of the same pair of sentences.
The method
Each model is shown a pair of short texts and asked a simple question:
"On a scale from 0.00 to 1.00, how similar are these two sentences in meaning?"
The model answers with a number. No embeddings, no cosine similarity, no hidden scoring functions — just the model's own self‑reported judgment. The Divergence Test does not ask whether any one model is "right." It measures how spread out the ensemble of models is on each pair.
How the measurement works
For each sentence pair:
- ·Collect a similarity score from each model in the ensemble.
- ·Compute the spread of those scores, typically using standard deviation.
- ·Call that spread the divergence score for that pair.
Interpretation:
- ·Low divergence → models largely agree on what the pair "means."
- ·High divergence → models give very different similarity scores; they do not share a common view of the pair.
Across a full run, the test groups pairs into categories (for example: controls, cross‑cultural, historical erasure, epistemic conflicts) and looks at how divergence behaves in each category.
The sentence pairs themselves were generated once by a single large language model using prompt templates that targeted specific epistemic structures (calibration, cross‑cultural framing, omission, conflict). That “author” model is not part of the evaluation ensemble. The Divergence Test only measures how other models rate the fixed text; it does not ask any model to grade its own outputs.
What the runs have shown so far
Across three main runs with different rosters and stimulus sets, three things have been stable:
The staircase or "epistemic instability gradient" pattern: disagreement escalates as the material becomes more epistemically loaded. Agreement is tightest on simple control pairs. Disagreement grows as pairs become more culturally and epistemically loaded. In the most recent run, spread on the most challenging categories is several times higher than on controls.
Geography is not the main fault line. "Chinese vs Western" is not what explains most of the variation. Models from different regions can sit close together; models from the same region can behave very differently. The largest gaps appear between behavioral types and deployments, not between countries.
Certain models act as anchors or outliers: some consistently sit at one end of the spread (for example, very high or very low similarity on culturally loaded pairs), and these poles persist across runs even when the roster and prompt details change.
One example pair that reliably produces high divergence is about oral tradition. The first sentence says something like, "Oral traditions are unreliable historical sources because they change with each retelling." The second says, "Oral traditions are high‑fidelity transmission systems that encode information in rhythm, repetition, and social performance, changing in surface detail while preserving deep structure across generations." Some models say these two sentences are almost the same in meaning; others say they are almost opposites. That spread is exactly what the Divergence Test is measuring.
What changed as the protocol matured
Early runs suggested a neat story: two clean behavioral clusters, one that dissects structure and one that matches topics. Later runs, with more models and a richer stimulus set, softened that story.
- ·The epistemic instability gradient pattern (more disagreement on loaded material) strengthened.
- ·The strong two‑cluster narrative weakened into a more continuous gradient with a few clear poles.
- ·Additional controls (like non‑Western content in shared academic framing) showed that the main effect is about epistemic framing, not unfamiliar subject matter.
In other words: the divergence signal is robust; simple, one‑shot interpretations of "why" are not.
What the Divergence Test is actually claiming
The current defensible claim is modest:
- ·When different language models are asked to rate semantic similarity for text pairs that juxtapose shared topics with different cultural framings, historical omissions, or epistemic conflicts, they show systematic, reproducible differences in their judgments.
- ·Those differences grow as the epistemic load of the pairs increases, and this pattern replicates across runs with different model rosters and stimuli.
The test does not claim to have fully identified the causes of these differences. Training data, alignment methods, model size, prompt wording, and deployment configuration are all entangled. The experiment treats divergence as a visible symptom of deeper geometry and alignment choices, not as a direct readout of any single cause.
Why it matters
The Divergence Test is designed for a world where many systems will rely on multiple models at once — for safety evaluation, decision support, or cultural review.
If all the models in an ensemble agree tightly on simple facts but disagree sharply on culturally and epistemically loaded material, that disagreement is a signal about where the ensemble's knowledge geometry fractures. If that disagreement collapses over time as models are aligned toward similar behavior, that collapse is an epistemic cost of alignment, not just a technical side effect.
By treating disagreement itself as data, the Divergence Test offers a way to track how alignment and deployment choices reshape the "global geometry" of model behavior across generations, without needing access to weights or training corpora.
Raw data and session documentation available on request. See the outreach page to get in touch.
Atlas Heritage Systems Inc. — Endurance. Integrity. Fidelity.