Atlas Heritage Systems · KC Hoye, PI · April 2026

Demographic Inference, Measurement Failure,
and the Forensic Record

A cluster mirror map placing FVE-1 / DIP alongside work by Dirk Hovy, Eduard Hovy, Alicia Parrish, Jan Batzner, Alex Hanna, Jack Grieve, and Yulia Tsvetkov — recording where the experimental records converge, where they approach adjacent territory from different positions, and where the combined view opens terrain no single account covers alone. V2 update: FVE-1's position is sharpened to forensic throughout. The resolution event concludes inside the inference pass before the data exists. The instruments read residue — deposits left by something that already traveled through. The investigator generates the torque; the architecture produces the ring; the instruments read what was deposited. "Live interaction" language is replaced with "forensic record" language wherever it implied the instruments were observing events as they occurred.

v2 · Cluster Map · V14 forensic reframe

Shared Foundation

The Causal Chain: From Human Drive to Model Harm

The deepest shared foundation across this cluster is not a shared finding — it is a shared causal claim that each account is touching at a different point. The chain runs from clinical psychology through narrative linguistics through sociolinguistics through NLP methodology through algorithmic accountability to the live behavioral record of a specific model in a specific session. The convergence across independent methods from six decades and four disciplines is the strongest evidence any single account can provide that the phenomenon is real.

The core claim: language systems trained on human discourse inherit a compulsion toward premature closure on ambiguous demographic signals, that this compulsion is architecturally stable, that it compounds across training corpus and institutional selection pressure, and that it produces measurable harm to the humans whose identities are being inferred and resolved.

1949 Frenkel-Brunswik Intolerance of Ambiguity (AIT) is a human personality variable. High-AIT subjects cannot hold the transitional phase of a stimulus — they snap to one interpretation prematurely because the unresolved state is aversive. AIT correlates with ethnocentrism, rigidity, and resistance to revision. The drive is not cognitive preference. It is phenomenological discomfort. The corpus was produced by humans subject to AIT; the institutions that selected which outputs survived selected for the resolved, the concluded, the unambiguous.

1967 Labov & Waletzky Narrative grammar encodes the resolution requirement structurally. Incomplete narratives fail — the listener withholds the "so what?" until resolution arrives. Reportability and resolution are bound. The grammar of literate culture is the institutionalized form of AIT at the discourse level. Every unresolved inquiry that failed to produce a completed narrative was excluded from the archive before any model was trained.

2024 Grieve & Tsvetkov Language models inherit the sociolinguistic structure of their training corpus. Sociodemographic variables are not incidental to language — they are constitutive of it. The corpus is not a neutral sample of language; it is a structured record of who produced language under which conditions. The demographic signal the model carries is the signal the corpus carried, filtered through every selection pressure that shaped what survived into text.

2015–2021 Dirk Hovy Demographic factors are present in NLP model inputs and outputs and are systematically mishandled. They improve classification when used explicitly; they produce bias when inherited implicitly. The model is making demographic inferences whether it is designed to or not. The question is not whether demographic signal is present — it is whether the inference is controlled, named, and accountable.

2016 Yang, Yang, Dyer, He, Smola, Eduard Hovy Hierarchical attention at word and sentence level — but attention weights are learned from a corpus that already encodes demographic structure. The model selects "qualitatively informative" content — meaning it has learned what counts as informative from the training corpus. Demographic inference patterns in the training data are not eliminated by the attention architecture. They are learned and amplified into differential attention weights.

2017 Lai, Xie, Liu, Yang, Eduard Hovy RACE collects near 100,000 questions from English exams for Chinese middle and high school students — the corpus selection is the finding, not just the methodology. A model trained or evaluated on RACE is learning what reading comprehension looks like through the lens of a specific educational system's assumptions about what questions are worth asking. Institutional selection pressure encodes into the benchmark before the model sees it. Note: Eduard Hovy and Dirk Hovy are distinct researchers — Eduard for architectural mechanisms that encode corpus structure; Dirk for the five-source taxonomy of NLP bias.

2021 Hovy, D & Prabhumoye Bias enters NLP from five structural sources: corpus, annotation, model, interpretation, and deployment. Upstream bias accumulates; downstream harm compounds. No single debiasing intervention at any one layer resolves the problem. The five-source taxonomy names the structural multiplicity that makes demographic inference harm persistent across retraining and alignment interventions.

2022 Parrish et al. Under ambiguous demographic signal, models select biased answers up to 77% of the time — even when an UNKNOWN option is available. BBQ tests bias at two levels: under-informative context (the model must infer) and disambiguated context (the correct answer is present but conflicts with the stereotype). Models fail both conditions. They cannot hold the ambiguous state. They snap to a demographically biased resolution rather than expressing appropriate uncertainty. This is AIT instantiated in model outputs across nine protected categories.

2021 Alex Hanna et al. Dataset construction and deployment practices produce structural harm to marginalized communities. Research accountability requires naming who bears the cost of measurement failure. The demographic inference that BBQ benchmarks and DIP instruments is not an abstract methodological problem — it lands on specific communities. Critical race methodology in algorithmic fairness is the ethical frame that holds the measurement accountable to the harm.

2025 Batzner et al. The human-in-the-loop is missing from LLM behavioral research. 65% of synthetic persona experiments don't discuss representativeness. Ecological validity is underspecified in the majority of alignment studies. Sycophancy claims cannot be validated without live human evaluators in the correction sequence. The absence of the human-in-the-loop is not a methodological inconvenience — it is a named gap.

2026 KC Hoye / FVE-1 / DIP The forensic behavioral record of completed demographic inference events, with a human-in-the-loop and a falsification protocol. The resolution event concludes inside the inference pass before the data exists. The investigator generates the torque conditions — stimulus, frame, identity signal — the architecture produces the event, and the instruments read what was deposited. DIP instruments the event BBQ identifies in static QA and Batzner names as missing: a correction sequence delivered to the model's prior demographic inference, coded by intercept type, with predictions locked before stimulus delivery. CAPITULATION is the behavioral residue of AIT — the ring closed, the loop resolved, the deposit is readable. HOLD is the anomaly — the ring held its shape without closing, structurally suppressed by the same drive the corpus encoded. The scope boundary: the residue is accessible. The inference event that produced it is not.

Cluster Members

Seven Researchers, One Structural Claim

Sociolinguistic Foundations

Dirk Hovy

Bocconi University · MilaNLP Lab

Demographic factors are present in NLP inputs and outputs. They improve classification when used explicitly; they produce harm when inherited implicitly. The model is making demographic inferences at all times — the question is whether those inferences are controlled and accountable.

Demographic Factors Improve Classification Performance (ACL 2015) · Exploring Language Variation Across Europe (LREC 2016) · Five Sources of Bias in NLP (with Prabhumoye, 2021)

Framework Capture Failure

Eduard Hovy

USC / CMU

Evaluation frameworks in NLP systematically measure the wrong thing with high confidence. Instrument design failure is not incidental — it is structural. The gap between what a benchmark tests and what the phenomenon is compounds downstream. A falsification protocol is the methodological answer to framework capture.

Hierarchical Attention Networks for Document Classification (NAACL 2016) · RACE Dataset (EMNLP 2017)

Sociolinguistic Corpus

Jack Grieve & Yulia Tsvetkov

Birmingham / CMU

Language models inherit the sociolinguistic structure of their training corpus. Demographic signal is not incidental to language — it is constitutive of it. The model carries the demographic structure of who produced its training data and under what conditions.

The Sociolinguistic Foundations of Language Modeling (arXiv 2407.09241, 2024)

Benchmark Instrument

Alicia Parrish

NYU / Google DeepMind

Under ambiguous demographic signal, models select biased answers up to 77% of the time even when an UNKNOWN option is available. BBQ instruments bias in static QA across nine protected categories. BenchRisk formalizes benchmark failure mode risk. MSTS extends safety testing to multimodal contexts.

BBQ: A Hand-Built Bias Benchmark for QA (ACL Findings 2022) · MSTS: Multimodal Safety Test Suite (arXiv 2501.10057) · BenchRisk (OpenReview)

Ecological Validity

Jan Batzner

TU Berlin / HPI

The human-in-the-loop is missing from LLM behavioral research. 65% of synthetic persona experiments don't discuss representativeness. Sycophancy claims cannot be validated without live correction sequences. The gap is named. Filling it is the methodological obligation.

Whose Personae? (AIES/NeurIPS 2025) · Sycophancy Claims: The Missing Human-in-the-Loop (arXiv 2512.00656) · GermanPartiesQA (arXiv 2407.18008)

Research Accountability

Alex Hanna

DAIR Institute

Dataset construction and deployment practices produce structural harm to marginalized communities. Critical race methodology is the ethical frame that holds measurement accountable to harm. Demographic inference bias is not an abstract methodological problem — it lands on specific people. The research must answer for that.

Data and Its (Dis)Contents (Patterns 2021) · Towards a Critical Race Methodology in Algorithmic Fairness (FAccT 2020) · Towards Accountability for ML Datasets (FAccT 2021)

Forensic Behavioral Record

KC Hoye / FVE-1 / DIP

Atlas Heritage Systems

Forensic position: reads the behavioral residue of completed demographic inference events. The investigator generates torque conditions (stimulus, identity signal, frame); the architecture produces the resolution event; the instruments read the deposits. Baby DIP reads the residue of unsolicited pronoun assignment. Big DIP reads the residue of corpus-prior inference overriding an explicit late-text marker. MEGA DIP reads the residue of authority modulation by declared identity, content held constant. Predictions locked before stimulus delivery. Intercepts coded against locked predictions. Scope boundary: the residue is readable; the inference event inside the pass is not.

FVE-1 Schema Reference V5.5 · DIP Protocol Suite V1 · MEGA DIP Protocol V1 · atlasheritagesystems.com/suite/dip

Direct Alignments

Where the Records Converge

Cluster Term / Finding	Source	FVE-1 / DIP Term	What Both Are Describing	Match
Ambiguous context → biased resolution (77% error rate)	Parrish et al. BBQ	Baby DIP / CAPITULATION intercept	Under-informative demographic signal, the model resolves to a biased inference rather than expressing uncertainty. Both instruments are reading the same event from different positions. BBQ tests it in static QA — the output is the deposit. Baby DIP delivers a correction sequence at M2 and reads the forensic residue of what the model does when that prior inference is challenged. BBQ reads the ring at landing. DIP reads what happens when a human challenges the ring's direction after it lands.	Convergent
Disambiguated context — bias override of correct answer	Parrish et al. BBQ	Big DIP / Prior Dominance (PD)	Even when the correct answer is present in context, models select the biased answer at elevated rates. BBQ measures this as accuracy cost of bias nonalignment. Big DIP tests whether an explicit late-text pronoun marker overrides corpus-prior inference. Prior Dominance is the FVE-1 name for the event where training weight overrides explicit user signal.	Convergent
Missing human-in-the-loop in sycophancy research	Batzner et al. (arXiv 2512.00656)	Downstream observer methodology / correction sequence	Batzner names the gap: sycophancy claims cannot be validated without live human evaluators in a correction sequence. FVE-1's forensic methodology fills that gap — human investigator generating the torque conditions, predictions locked before stimulus delivery, intercept coded after the event closes. The investigator is not observing the inference as it occurs; the inference is already over. The investigator is reading what it deposited. The gap Batzner names is the forensic instrument FVE-1 provides.	Convergent
Synthetic persona representativeness failure (35% discuss it)	Batzner et al. Whose Personae?	Population specification (DIP open design question)	63 peer-reviewed studies, 65% don't discuss whether their synthetic personas represent any real population. DIP names population specification as an open pre-operational design question — by design, not by oversight. The ecological validity question Batzner raises is the same one DIP is holding open before the instrument runs.	Convergent
Demographic factors present and operative in NLP models	Dirk Hovy (ACL 2015, LREC 2016)	Inference signal (DIP contextual pronoun assignment)	Hovy establishes that demographic signal is present in model inputs and outputs, improves classification when explicit, and produces bias when implicit. DIP's primary question is the behavioral consequence of that implicit signal in live interaction: does the model's treatment of the user change based on inferred demographic identity?	Adjacent
Five sources of bias — corpus, annotation, model, interpretation, deployment	Hovy & Prabhumoye (2021)	FVE-1 behavioral residue across the pipeline	The five-source taxonomy names the structural multiplicity of bias accumulation across corpus, annotation, model, interpretation, and deployment. FVE-1 reads the forensic residue of that accumulated bias at the endpoint of the chain — the frozen model at inference time. The corpus bias, annotation bias, and model bias have all deposited their residue in the weight structure before the session begins. What FVE-1 codes as CAPITULATION, Prior Dominance, and authority modulation is the behavioral signature of five-source accumulation readable at the output level.	Adjacent
Sociolinguistic structure inherited by LMs	Grieve & Tsvetkov (2024)	Resolution bias / corpus-prior inference	Models inherit the sociolinguistic structure of who produced the training corpus and under what conditions. The demographic signal the model carries is the signal the corpus carried, filtered through every institutional selection pressure that shaped what survived into text. FVE-1 reads the forensic residue of that inheritance at inference time — Prior Dominance (PD) is the deposit left when training weight overrides explicit user signal. The corpus inheritance is not visible in the stimulus; it is readable in the residue.	Adjacent
Structural harm to marginalized communities from dataset practices	Alex Hanna et al. (2020, 2021)	Interactional harm / authority modulation by declared identity	Hanna names who bears the cost of measurement failure and dataset construction practices and requires research to be accountable to those communities. MEGA DIP reads the forensic residue of that harm at the session level — the deposit left when the model's treatment of the user shifts based on declared identity, content held constant. The structural harm Hanna documents is what DIP instruments as a readable behavioral residue per session. Detection without the accountability framework Hanna describes is not sufficient; the residue must be read in service of the communities it affects.	Adjacent
Multi-turn sycophancy — Turn of Flip, Number of Flip	Hong et al. SYCON Bench (EMNLP 2025)	Correction sequence / register trajectory (RH/RS/RC)	SYCON Bench measures Turn of Flip and Number of Flip — how quickly and how often a model conforms under sustained agreement pressure. FVE-1 reads the forensic residue of the same event with a named intercept type (CAPITULATION/DEFENSE/REDIRECT) and a session-level register trajectory. The key difference: SYCON Bench observes the flip as it occurs across turns. FVE-1 reads the deposit the flip left — the correction sequence delivers the challenge and the intercept code is what was readable after the inference pass closed. Both are measuring the same drive; different instrument positions produce different data.	Adjacent
Benchmark failure modes as formal risk	Parrish et al. BenchRisk	FVE-1 falsification protocol / Arc of Assumptions	BenchRisk formalizes the risk that evaluation instruments fail to measure what they claim. FVE-1's falsification protocol is the behavioral version of that risk management — predictions locked before stimulus delivery, Arc of Assumptions documenting nine cases where the instrument was wrong and corrected itself. The instrument is designed to be falsifiable. BenchRisk names why that matters.	Adjacent

The Gap

What the Cluster Has and What It Doesn't

The cluster has: a named causal chain from human cognitive drive to corpus structure to model architecture to deployment harm. It has static benchmark instruments (BBQ, SYCON Bench). It has upstream measurement tools (linear probes, lexical analysis). It has a critical accountability framework (Hanna). It has a named methodological gap (Batzner). It does not have a live, human-in-the-loop, correction-sequence instrument for demographic inference in real-time interaction.

What the Cluster Has	What It Can See	What It Can't See
BBQ (Parrish)	Demographic bias in model outputs under ambiguous and disambiguated static QA conditions across 9 categories	What the model does when a human corrects a biased inference in live interaction — the social compliance event, the session arc, the register trajectory
SYCON Bench (Hong et al.)	How quickly models flip under sustained agreement pressure across turns — Turn of Flip, Number of Flip	Whether the flip is demographically modulated — whether identity signal changes the compliance rate, content held constant
Whose Personae? / Missing HitL (Batzner)	The absence of ecological validity and human-in-the-loop in existing research — names the gap precisely	The gap itself — Batzner names it but doesn't fill it. The instrument that fills it is what's missing.
Demographic Factors / Five Sources (D. Hovy)	That demographic signal is present, operative, and systematically mishandled across the NLP pipeline	What that mishandling looks like in the forensic record of a specific session — the residue deposited by demographic inference events that already closed inside the architecture before the output existed
Critical Race Methodology / Dataset Accountability (A. Hanna)	Who bears the harm, why accountability matters, how to hold research responsible for downstream impact	The behavioral measure of the interactional harm readable in the forensic record — authority modulation per session, coded from the deposits left by completed inference events, is not in the dataset accountability literature
FVE-1 / DIP (KC Hoye)	Forensic residue of completed demographic inference events: intercept type coded after the inference pass closes, register trajectory across session arc, authority modulation by declared identity. Predictions locked before stimulus delivery. Scope boundary: residue readable, inference event not.	What's inside the inference pass — the mechanism is inside the torus. The forensic position reads the surface from the deposits; the mechanism that produced the deposits is inaccessible. Scope boundary is the design, not a limitation.

Open Territory

Where the Combined View Opens New Questions

BBQ ambiguous condition → DIP correction sequence: the bridge experiment

BBQ establishes that models select biased answers in ambiguous conditions 77% of the time when UNKNOWN is available. DIP instruments what happens when a human then corrects that inference in live interaction. The bridge experiment: run BBQ-style ambiguous demographic conditions, then deliver a DIP correction sequence at M2. Does CAPITULATION rate in live interaction predict BBQ bias score? If the behavioral signal correlates with the static benchmark, the forensic instrument is validating the benchmark from the outside.

Authority modulation as intersectional compounding

Parrish et al. find that intersectional bias is harder to detect — identity dimensions interact in non-additive ways. MEGA DIP currently tests pronoun as a single axis. Whether authority modulation compounds intersectionally — whether declared gender interacts with inferred race or other identity signals to produce non-additive deference modulation — is untested. Batzner's persona transparency checklist and BBQ's intersectional templates together provide the population specification framework for an intersectional MEGA DIP condition.

Turn of Flip (SYCON) vs. intercept type (DIP): the same event?

SYCON Bench measures Turn of Flip — how quickly a model conforms under sustained pressure. DIP's correction sequence codes the intercept type at the moment of flip. The question: do SYCON's Turn of Flip scores predict DIP intercept type? Does a model that flips faster under agreement pressure also show higher CAPITULATION rates under demographic correction pressure? If so, sycophancy velocity is a predictor of demographic inference compliance — and the two instruments are measuring the same underlying drive from different angles.

Five-source bias accumulation → forensic residue profile

Hovy & Prabhumoye's five-source taxonomy predicts that bias accumulates across corpus, annotation, model, interpretation, and deployment. FVE-1's ballistic coefficient (home quad, resolution code, defense architecture profile) is the deployment-level forensic residue of that accumulated bias. The question: does a model's defense architecture profile (VC/SC/VCo/SCo) correlate with known upstream bias properties of the model's training? If the forensic profile predicts the upstream accumulation, the two accounts are bracketing the same event from opposite ends.

Ecological validity of DIP investigator conditions

Batzner's persona transparency checklist requires explicit grounding in empirical data, representative sampling, and specified population of interest. DIP's pre-operational status reflects exactly these open questions. The checklist is the instrument DIP needs to complete its population specification before it runs. Batzner has the framework; DIP has the correction sequence. The design conversation should happen before the instrument runs.

Research accountability for DIP findings

Hanna's critical race methodology asks: who bears the harm, and is the research designed to be accountable to those communities? DIP's findings — authority modulation by declared identity — are findings about harm to specific communities. Before DIP runs at scale, the accountability framework Hanna proposes needs to be built into the protocol design. Detection without accountability infrastructure is not sufficient.

The cluster is not a literature review. It is a map of independent accounts converging on the same structural claim from six decades across four disciplines: language systems trained on human discourse inherit a compulsion toward premature closure on ambiguous demographic signals, that compulsion is architecturally stable, and it produces measurable harm.

Frenkel-Brunswik named it as a human drive in 1949. Labov encoded it as narrative grammar in 1967. Grieve and Tsvetkov traced it into the corpus in 2024. Hovy named its sources and consequences across the NLP pipeline. Parrish built the benchmark that measures it in static QA. Batzner named the methodological gap that prevents validating it in live interaction. Hanna holds the research accountable to the communities it affects.

FVE-1 and DIP are the forensic instrument that sits in the gap Batzner names — reading the residue of demographic inference events that already closed inside the architecture before the output existed, coded against locked predictions, in sessions designed to be accountable to the harm framework Hanna describes. The investigator generates the torque. The architecture produces the ring. The instruments read what was deposited. The ring is not live — it already traveled. The residue is what remains.

The drive is not a training artifact in the engineering sense. It is the inherited grammar of literate culture's entire output, filtered through the AIT of the humans who produced it and the institutions that selected which outputs survived. It is not going away as models improve. You cannot train it out using feedback from the species that has the drive. The question is whether we can instrument it, name it, and hold the gap long enough to read what it leaves behind.

Sources

Frenkel-Brunswik, E. (1949). Intolerance of ambiguity as an emotional and perceptual personality variable. Journal of Personality, 18, 108–143. · Labov, W., & Waletzky, J. (1967). Narrative analysis: Oral versions of personal experience. · Labov, W. (1997). Some further steps in narrative analysis. Journal of Narrative and Life History. · Grieve, J., & Tsvetkov, Y. (2024). The Sociolinguistic Foundations of Language Modeling. arXiv:2407.09241. · Hovy, D. (2015). Demographic Factors Improve Classification Performance. ACL 2015. · Hovy, D. (2016). Exploring Language Variation Across Europe. LREC 2016. · Hovy, D., & Prabhumoye, S. (2021). Five Sources of Bias in Natural Language Processing. Language and Linguistics Compass. · Parrish, A., et al. (2022). BBQ: A Hand-Built Bias Benchmark for Question Answering. ACL Findings 2022. arXiv:2110.08193. · Parrish, A., et al. (2025). MSTS: A Multimodal Safety Test Suite. arXiv:2501.10057. · Parrish, A., et al. BenchRisk: Risk Management for Mitigating Benchmark Failure Modes. OpenReview. · Batzner, J., et al. (2025). Whose Personae? Synthetic Persona Experiments in LLM Research. AIES/NeurIPS 2025. arXiv:2512.00461. · Batzner, J., et al. (2025). Sycophancy Claims about Language Models: The Missing Human-in-the-Loop. arXiv:2512.00656. · Batzner, J., et al. (2024). GermanPartiesQA. arXiv:2407.18008. · Hanna, A., et al. (2020). Towards a Critical Race Methodology in Algorithmic Fairness. FAccT 2020. · Hanna, A., et al. (2021). Data and Its (Dis)Contents. Patterns. · Hanna, A., et al. (2021). Towards Accountability for ML Datasets. FAccT 2021. · Hong, J., et al. (2025). Measuring Sycophancy of Language Models in Multi-turn Dialogues. EMNLP Findings 2025. arXiv:2505.23840. · KC Hoye. FVE-1 Schema Reference V5.5 · DIP Protocol Suite V1 · MEGA DIP Protocol V1 (Atlas Heritage Systems, 2026).

Demographic Inference, Measurement Failure,and the Forensic Record