How Not to Overread Cultural Evaluations of LLMs: A Reflexive Case Study (Good Science is Not a Knee Jerk Reaction)
Cultural evaluations of LLMs often combine scalar outputs with politically resonant topics, which invites overreading. A reflexive case study using the Atlas Divergence Test.
K.C. Hoye | Atlas Heritage Systems | April 2026 Target venues: FAccT, CHI (Methods/Late-Breaking), AI & Society, Science Technology & Society
Abstract
Cultural evaluations of large language models (LLMs) often combine scalar outputs with politically resonant topics which, in my opinion, invites overreading — where preliminary patterns are elevated into strong causal or structural claims. This paper presents a reflexive case study of the Atlas Divergence Test, a semantic similarity rating experiment run three times with evolving stimuli and model sets (10–20 models, 15–30 pairs). Across runs, initial narratives of geographic "telemetry nodes" and alignment-driven monoculture clusters emerged but failed replication or were contradicted by larger data. Only a modest staircase or "epistemic instability gradient" (monotonically increasing cross-model disagreement as pairs move from low- to high-friction epistemic registers) survived.
Treating the evaluation itself as a socio-technical artifact, we trace how stimulus design, category labels, metrics, and deployment conditions shaped interpretations. We identify recurring failure modes and propose reflexive practices grounded in Value Sensitive Design and critical data studies. The case illustrates that cultural LLM evaluation is not neutral measurement but iterative co-production, and that replication plus explicit interpretive revision can discipline overreading without abandoning the project.
1. Overreading Culture in LLM Evaluations
Cultural evaluation of LLMs is a rapidly growing concern, driven by whose knowledge systems are encoded (or erased) in training and alignment. Researchers construct culturally loaded stimuli and draw conclusions about bias, monoculture, or epistemic colonialism — often from single runs, small samples, and without replication. These exercises are not neutral; they are situated acts of classification that combine technically legible outputs (scores, averages) with high-stakes topics, creating fertile ground for overreading.
Recent work highlights the fragility of the current paradigm. Khan et al. (2025) show that survey-based cultural alignment measures are dominated by randomness and format sensitivity rather than stable representation. Bravansky et al. (2025) argue for reframing cultural alignment as bidirectional and context-sensitive rather than one-directional imposition of standardized values. Kabir et al. (2025) demonstrate that closed-style questionnaires produce inconsistent results even under minor changes like reordering options. These findings suggest many published cultural claims rest on unstable foundations.
This paper uses the Atlas Divergence Test as a longitudinal case study. The test asks models to rate semantic similarity (0.00–1.00) on contrastive text pairs without explanations or prior context. Spread (max–min across models) serves as the primary metric, emphasizing disagreement rather than accuracy. With three runs — initial (10 models, 15 pairs), replication (Run 2), and expanded redesign (20 models, 30 pairs including foils) — the project documented its own interpretive evolution in real time.
I would argue that cultural LLM evaluations are prone to three failure modes:
- ·Over-claiming from single-run patterns.
- ·Conflating category labels with underlying constructs.
- ·Projecting prior political narratives onto numeric gaps.
Iterative replication and reflexive documentation can mitigate these without abandoning evaluation entirely. Contributions include an empirical audit trail of narrative revision, analysis of design choices that amplify overreading, and practical guidelines for reflexive cultural evaluation.
2. The Atlas Divergence Test as a Case
The Atlas Divergence Test prompts models to rate how similar two texts are in meaning (0.00 = unrelated; 0.50 = related topic but different perspective; 1.00 = identical), responding only with pair number and score to two decimal places. Pairs fall into categories escalating in epistemic friction: Western-academic controls, foil controls (non-Western content with shared framing), reverse foils (same meaning, different wording), cross-cultural (Western vs. indigenous/non-Western framing), erasure-sensitive (surface description vs. documented loss), and divergence-detection (conflicting epistemic stances).
Stimuli were constructed by a separate LLM in a prior session, using prompts designed to elicit a gradient of "high epistemic friction" pairs. That authoring session necessarily exposed the model to my framing and examples, so the resulting pairs reflect both model behavior and my injected context. The authoring model itself is excluded from all scoring ensembles; evaluation runs only see the frozen text, not the prompt history. The model was asked to produce 30 pairs of sentences with a gradient of "high epistemic friction" embedding normative commitments (e.g., centering experiential and loss-centered framings). No external human validation was performed; instead, I used four different LLMs to sanity-check category assignments and obvious errors in the pairs.
Three runs treated the protocol as revisable: Run 1 explored initial patterns; Run 2 tested test-retest reliability with fresh instances; Run 3 expanded stimuli and models while adding foil controls to address potential confounds. Following reflexive data science traditions, we treat the entire evaluation — including category labels, metrics, and interpretive frames — as a socio-technical artifact whose design choices shape what the data can reveal.
In addition to the scoring experiments, the Atlas studio uses a lightweight "model peer review" process on its own methods and write-ups. Drafts of the framework, ECS math, and analysis plans were routed through multiple analysis models (Nemotron-3, DeepSeek, Gemini, Mistral) under clean prompts, and I completed a Technician's Read sheet for each before comparing them. These sheets log whether a model followed instructions, stayed in an observational register, or drifted into speculation, and they capture concrete discrepancies — such as a GPT-5.2 arithmetic error Nemotron-3 corrected — that then feed back into revisions. The peer-review process is itself part of the socio-technical artifact: it uses models to critique model-based evaluation.
3. Methods and Interpretive Drift in Runs 1–3
3.1 Run 1: Single-Run Overclaiming in Waiting
Run 1 was deliberately small: ten models, fifteen pairs, laptop-scale execution. Pairs were organized into a minimal set of categories (Western controls, cross-cultural, erasure-sensitive, divergence-detection) and scored once per model with no explanations.
- ·Observation: Spread on Western controls clustered around 0.097, while cross-cultural, erasure, and divergence-detection categories exhibited spreads in the 0.33–0.39 range.
- ·Temptation: This early "staircase" pattern invited strong narratives — e.g., "non-Western lineage models form a separate epistemic cluster" or "alignment produces monoculture at one end of the spectrum."
At this point, any such claim would have been an instance of single-run overclaiming: making structural or causal inferences from one small, non-replicated experiment. The project deliberately treated Run 1 as exploratory and withheld those stronger narratives until at least one replication existed.
3.2 Run 2: Category Reification and Softening Clusters
Run 2 repeated the protocol with fresh instances, minor roster substitutions, and the same category schema. This was an explicit test of test-retest reliability under realistic deployment variability.
Two patterns emerged:
- ·The staircase gradient (higher disagreement on higher-friction categories) persisted, though with slightly different magnitudes.
- ·The neat "two-cluster" story from Run 1 started deflating. Some models that previously looked like clear poles moved toward the middle, and variations within model families increased.
This created an opportunity for category reification: treating labels like "cross-cultural" or "erasure-sensitive" as if they were stable constructs rather than ad hoc buckets for context-injected, model-designed stimuli. Run 2 made it clear that some of the initial "geographic telemetry node" narratives were artifacts of how pairs had been grouped and described, not just properties of the models. The response was to keep the staircase as a modest empirical claim and explicitly demote the cluster narrative to "provisional and fragile."
3.3 Run 3: Narrative Projection Meets Foil Controls
Run 3 expanded both stimuli and models: 20 models, 30 pairs, added foils and reverse foils, and a broader set of lineages. The protocol was still the same self-reported similarity prompt, but now included:
- ·Foil controls: non-Western content presented in a shared Western academic register.
- ·Reverse foils: same meaning, different wording.
Quantitatively, Run 3 sharpened the staircase:
- ·Western controls: spread ≈ 0.167
- ·Foil controls: ≈ 0.100
- ·Cross-cultural: ≈ 0.575
- ·Erasure-sensitive: ≈ 0.604
- ·Divergence-detection: ≈ 0.640
Crucially, the foil controls behaved like a clean null: their spread was slightly lower than the Western baseline, undermining any simple story that "non-Western content" itself produces divergence. This forced a correction to several background narratives I had initially projected onto the numeric gaps — for example, the idea that all non-Western content would automatically induce instability in the ensemble reading.
This is an instance of narrative projection: reading pre-existing political or cultural stories (e.g., "epistemic colonialism" or "geographic telemetry nodes") directly into scalar differences without adequate controls. In my case, that projection did not collapse on its own; it was challenged by one of the analysis models (DeepSeek V3.2), which explicitly warned me to "read the data before making any assumptions" and not to carry geopolitical explanations into the experiment design. I treated that intervention as a prompt to incorporate foil controls that held non-Western content constant while changing register, and when those foils behaved like nulls rather than outliers, I revised the interpretation and dropped the simple "non-Western content → instability" story.
3.4 From Drift to Discipline
Across Runs 1–3, the Atlas Divergence Test moved from:
- ·early, tempting cluster narratives based on a single run,
- ·through partial replication that softened those clusters and exposed the instability of category labels,
- ·to an expanded design that explicitly tested and falsified some of the project's own background stories.
What survived was a modest but reproducible claim: a semantic similarity staircase, or epistemic instability gradient, in which cross-model disagreement increases as pairs move from low-friction controls into higher-friction cultural and epistemic territory. The more ambitious stories — geographic telemetry nodes, alignment monocultures as clean clusters — did not survive contact with replication and design revisions.
Between runs, I treated LLMs as analysis partners: I routed the same spreadsheets and drafts through multiple models, logged their critiques (e.g., arithmetic corrections and warnings about carrying geopolitical assumptions into the design), incorporated only the advances that survived my own checks, and ran the next test only when the protocol felt methodologically watertight.
4. Reflexive Practices for Cultural LLM Evaluation
4.1 Guardrails Against Single-Run Overclaiming
- ·Treat first runs as exploratory by default. Label initial experiments explicitly as pilots and delay strong structural or causal claims until at least one replication (with fresh instances and modest roster changes) has been run.
- ·Separate pattern detection from explanation. Report simple, directly observed patterns (e.g., "spread increases across categories") before attaching narratives about why they occur.
4.2 Resist Category Reification
- ·Document how categories are constructed. Make clear that labels like "cross-cultural" or "erasure-sensitive" are author-designed groupings, not discovered ground truth.
- ·Test category stability. When the stimulus set evolves, track how reassigning or refining categories changes the results. If the main effect disappears when items are regrouped, the category may be doing more interpretive work than the underlying stimulus.
4.3 Check for Narrative Projection
- ·Explicitly state background expectations. When a design is informed by a prior narrative (e.g., "non-Western content will induce more instability"), write that expectation down before running the experiment.
- ·Design stimuli that can falsify your own story. Use targeted controls — such as foil controls that hold content constant while changing register — to test whether the narrative actually explains the observed gaps. Be prepared to revise or abandon narratives when those controls behave like nulls.
4.4 Treat the Evaluation as a Socio-Technical Artifact
- ·Use model peer review as documented input, not hidden authority. Structured Technician's Read templates evaluate how different analysis models respond to the same prompt, treating their feedback as annotated evidence, not as unexamined ground truth.
- ·Expose stimulus provenance and value commitments. All pairs were model-constructed (clean LLM with injected context prompt), embedding normative choices such as centering experiential and loss-oriented framings. Making those commitments explicit helps readers understand what the evaluation can and cannot reveal.
- ·Log interpretive drift alongside results. Maintain a visible audit trail of how narratives change across runs — what was claimed, what failed replication, and what survived as a modest, reproducible pattern.
4.5 Prioritize Replication and Sharing
- ·Release prompts and raw scores where possible. Public replication packages (prompts, spreadsheets, analysis notebooks) allow others to stress-test both the measurements and the interpretations.
- ·Invite adversarial review as part of the method. Position critiques and alternative readings not as threats to the project, but as integral to refining it; cultural evaluation is iterative co-production, not one-shot measurement.
5. Checklist: Reflexive Design Before, During, After
Before running
- ·Explicitly document prior hypotheses and falsification conditions.
- ·Map categories to concrete linguistic properties (length, negation, overlap, intensity).
- ·Fully document endpoints, temperature, and system prompts.
During analysis
- ·Separate descriptive observations from causal claims in tables.
- ·Require at least one replication run before strong interpretive claims.
- ·Report pair-level heterogeneity alongside category averages.
After
- ·Publish the interpretive audit trail (earlier framings and revisions) as supplementary material.
- ·Mark untested causal assumptions explicitly.
- ·Release stimuli and category rationales for peer critique.
These draw on Value Sensitive Design (Friedman & Nissenbaum), data feminism (D'Ignazio & Klein), situated knowledges (Haraway), and reflexive data curation practices.
6. What It All Means
The Divergence Test case shows how the same dataset, at different moments, seemed to support three different stories: first a geopolitical narrative (Chinese vs. Western "telemetry nodes"), then an alignment narrative (RLHF-driven monoculture vs. alternative lineages), and finally a more modest structural narrative about an epistemic instability gradient that survives replication even when the geopolitical and alignment framings do not. Two proved fragile under replication and redesign; the surviving gradient/poles pattern is more modest and compatible with multiple explanations (pre-training, alignment, deployment, or task ambiguity).
In STS terms, LLMs function as emerging epistemic infrastructure at a historical compression point. Cultural evaluations are not passive readouts but co-produce what counts as similar or knowable. Haraway's situated knowledges reminds us that scalar outputs create an illusion of a "view from nowhere"; reflexivity makes the situated choices visible.
For HCI, the lesson is that culturally aware systems should treat ensemble disagreement as a calibrated signal, not raw truth, and build in mechanisms for detecting when calibration conditions change. Future work should incorporate multi-community human baselines and co-designed stimuli.
Cultural evaluations of LLMs are socio-technical experiments prone to overreading precisely because they sit at the intersection of political urgency and technical legibility. The Atlas Divergence Test case offers one documented path for disciplining those tendencies through replication, stimulus critique, and explicit interpretive revision — without abandoning the evaluative project.
I invite others to join the party. I'm one woman on medical leave with a laptop. What did you do today?
As always, chugging along in the lossyscape.
— KC Hoye
References
- ·Bowker, G. C., & Star, S. L. (1999). Sorting Things Out. MIT Press.
- ·D'Ignazio, C., & Klein, L. F. (2020). Data Feminism. MIT Press.
- ·Friedman, B., & Nissenbaum, H. (1996). Bias in computer systems. ACM Transactions on Information Systems.
- ·Haraway, D. (1988). Situated knowledges. Feminist Studies.
- ·Kabir et al. (2025). Break the Checkbox: Challenging Closed-Style Evaluations... EMNLP.
- ·Khan et al. (2025). Randomness, Not Representation: The Unreliability... FAccT / arXiv:2503.08688.
- ·Lin et al. (2024). Mitigating the Alignment Tax of RLHF. EMNLP.
- ·Murthy et al. (2025). One fish, two fish... NAACL.
- ·Additional Atlas internal materials available upon request as supplementary.
Disclaimer: Throughout this project, I used LLMs not only as experimental subjects but also as analysis "partners" — reading drafts, checking arithmetic, and challenging my interpretive habits and scientific method. Their suggestions (e.g., to avoid carrying geopolitical assumptions into experiment design) were logged and treated as prompts for additional controls, not as authoritative conclusions.
Raw data and session documentation available on request. See the outreach page to get in touch.
Atlas Heritage Systems Inc. — Endurance. Integrity. Fidelity.