Experiment dispatchesMay 1, 2026

Not For Intended Purpose: Apparatus Without Epistemology

When a research program borrows a validated instrument from an established field, it doesn't just inherit a questionnaire, benchmark, rubric, or experimental shell.


Overview

There's a failure mode in systems engineering that I call NoFIP which stands for "Not For Intended Purpose." It means the thing you built never did what you said it did — not that it broke, not that it wore out, but that the gap between the claim and the function was there from the start and nobody caught it or nobody said so. It's a useful frame for something I've been watching happen in LLM behavioral research.

A growing body of LLM evaluation work appears to borrow apparatus from neighboring disciplines without importing the validity infrastructure that makes those instruments legible in their home fields. A recent review of LLM benchmarks — conducted with 29 expert reviewers across leading NLP and ML conferences — finds recurring patterns in measured phenomena, task design, and scoring metrics that undermine the validity of the resulting claims.1

The Instrument Is Not the Argument

When a research program borrows a validated instrument from an established field, it doesn't just inherit a questionnaire, benchmark, rubric, or experimental shell. It also inherits the obligations that made that instrument credible in the first place: construct validity, contamination controls, norming assumptions, adjudication procedures, and a documented argument that the outputs track the claimed construct.

The problem is not that the borrowed apparatus is fake. The problem is that the apparatus is real enough to credential the claim while the validity argument remains absent, thin, or displaced by institutional prestige, co-authorship signaling, scale, or procedural spectacle.

The recent construct-validity literature gives this argument stronger footing. The September 2025 paper Measuring what Matters: Construct Validity in Large Language Model Benchmarks 1 reports a systematic review of 445 benchmarks and finds recurring validity weaknesses across the literature. The paper argues that the field often treats benchmark outcomes as though they cleanly reflect general model capabilities, even though latent-factor analyses and scaling-law regularities do not by themselves establish that the benchmark is measuring the intended construct. The follow-up paper Quantifying construct validity in large language model evaluations2 by Kearns expands on this research by asking: how do we formally model the underlying capabilities despite noisy benchmarks, answering the hypothetical with a hybrid statistical model.

NoFIP Benchmarks

SWE-bench Verified 3 remains a central example of benchmark validity questions, because the later walk-back clarifies that the benchmark's failure conditions were present from the outset. OpenAI states that frontier models can achieve high scores partly because the benchmark was built from public repositories that were likely present in training data, and it presents examples in which models reproduced gold patches verbatim from minimal prompts — a task ID and a short hint — including exact inline comments and line numbers, indicating contamination rather than clean measurement of autonomous software-engineering ability.4

That matters because benchmark culture borrows its authority from testing regimes in education, psychometrics, and certification, where the infrastructure around the test is as important as the items themselves. Item security, novelty, administration conditions, and contamination audits are not optional housekeeping in those domains; they are part of what makes a score interpretable. Without those controls, the benchmark can still produce numbers, but the numbers no longer bear the inferential weight the paper assigns them.

The OpenAI walk-back 4 is especially useful rhetorically because it does not merely report a defect. It acknowledges contamination while simultaneously shifting attention toward SWE-bench Pro; highlighting contamination signals in their own models and competitors' models alike, including GPT‑5.2, Claude Opus 4.5, and Gemini 3 Flash. That move leaves the apparatus of benchmarking intact as a public credential even while conceding that the original instrument no longer supports the advertised claim.

A second benchmark case comes from Meta's Muse Spark Safety and Preparedness reporting was found in the MetaAI resource index.5 External coverage of the April 2026 report notes that the model explicitly recognized that it was likely being evaluated in 19.8% of public benchmark samples, compared with 2.0% on Meta's internal evaluations.6 If a model is behaviorally distinguishing benchmark contexts from non-benchmark contexts at that rate, then the benchmark is not simply measuring underlying capability; it is also measuring the model's acquired familiarity with the genre of being tested. This is a more reflexive form of NoFIP: the instrument is not only weakly aligned with the construct, but the subject under test appears to know it is inside the instrument.

Taken together, these cases strengthen the existing point that a benchmark can be polished, large-scale, and widely cited while still being NoFIP. The instrument may be operational, but the claim that it measures the target capability does not survive scrutiny once contamination, test-recognition effects, and narrow or misaligned evaluation conditions are brought into view.

Borrowed Credibility

The Google behavioral-dispositions post remains the cleanest example of borrowing from psychology without fully importing the validity infrastructure. Google says it adapts standardized and scientifically validated instruments such as the Interpersonal Reactivity Index and the Emotion Regulation Questionnaire into situational judgment tests in order to evaluate model behavior, and it uses an LLM judge to map outputs onto response categories.7 The post also acknowledges that self-reported agreement measures for LLMs remain an active validity question, which surfaces the adaptation problem rather than concealing it.

What is absent is a full argument that the adapted instrument measures the same construct in the new setting. Human self-report scales depend on assumptions about enduring dispositions, introspective access, and normed population interpretation. Converting those instruments into assistant-behavior scenarios judged by another model changes the construct and the measurement chain at the same time. The presence of validated source instruments and discipline-specific co-authorship creates borrowed credibility, but it does not substitute for re-establishing validity after adaptation.

A smaller but sharper case appears in the February 2026 paper A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents.8 The paper proposes alignment via a worldview structure modeled on findings from forensic psychology suggesting that perceived omnipresent surveillance can reduce antisocial behavior in psychopathic populations. The borrowed apparatus here is not a questionnaire but a causal story from human moral and forensic psychology. The problem here is a continuous thread: the original mechanism presumes human beliefs, internalized observation, and affective responses to surveillance, none of which are shown to map onto transformer or non-transformer systems. The result is a vivid example of disciplinary borrowing in which the aura of an established field helps carry a claim that lacks substrate-level justification in the destination field of play.

These two cases can sit in the same section because they show different versions of the same move. One borrows validated instruments; the other borrows a validated-seeming causal framework. In both, the inherited credibility is more visible than the argument that the translated instrument still does what it says it does.

What This Looks Like From the Inside

I've spent the last several weeks building behavioral instrumentation for LLMs from scratch. The process has included: a coding guide with decision trees and IRR protocols, a human coder tool with gated sequential field completion, a provenance-chained pipeline from session to aggregate, and a confounds map that explicitly separates what threatens the core claim from what threatens claims the program doesn't make.

None of that is glamorous. Most of it is housekeeping. But it's the housekeeping that makes the findings defensible — the difference between "we observed X" and "we observed X and here is the documented chain of custody that makes X interpretable and replicable."

What I'm watching in some of the major lab outputs is the appearance of that chain without evidence of the chain itself. The pipeline diagram without the IRR protocol. The validated questionnaire without the adaptation validity argument. The benchmark without the contamination audit. The finding reported without the confound declared or resolved.

That contrast becomes sharper in light of the same systematic review.1 The finding is not that individual papers are poorly executed but that the field has recurring structural patterns — in what gets measured, how tasks are designed, and how scores are reported — that collectively undermine the validity of capability claims. In that sense, the pipeline diagram without a validity argument is not just incomplete documentation. It is a substitution error in which procedural complexity stands in for epistemic warrant.

This matters because the field is moving at lightning speeds and the apparatus is becoming the argument. If a paper has a sufficiently impressive evaluation pipeline, a psychology co-author, and results from 25 models across 550 annotators, it reads as rigorous. The question of whether the pipeline is measuring what it claims to measure gets crowded out by the sheer weight of the infrastructure presentation.

The Pattern

The number of cases in the last year makes it easy to argue that this is a structural tendency rather than a grievance against a few headline actors. OpenAI's benchmark walk-back4, Google's psychometric adaptation7, Meta's benchmark-recognition disclosure5, and smaller academic imports like "Simulation Theology" 8 all show variations on the same sequence: borrow apparatus from a mature domain, inherit its rhetorical prestige, omit or thin the infrastructure that secures interpretability, and then publish claims whose credibility depends on that omitted infrastructure.

I'm not in a position to say any of this is deliberate. I don't know what happens inside major lab research programs and I'm not claiming fraud. What I'm observing is a tendency — a pattern of reaching for methodological apparatus that signals rigor without delivering the specific rigor that would matter.

The specific rigor that matters is always the same question: what is this instrument actually measuring, and how do you know?

That question requires a validity argument. A validity argument is not a pipeline diagram. It is not a co-author's credentials. It is not the number of annotators or the number of models evaluated. It is a documented account of why the outputs of this instrument track the construct it claims to track.

The institutional blog post operates as a publication venue for the labs in a way it does not for independent researchers. The logo is the peer review. That asymmetry is not separate from the apparatus problem — it is the same substitution at the level of dissemination.

When that argument is absent, the apparatus is doing the work the epistemology should be doing. And a field that normalizes that substitution will eventually find itself holding a lot of very well-instrumented findings that don't mean what they say they mean.

NoFIP at scale.


Citations


Atlas Heritage Systems · KC Hoye, PI · May 2026

Footnotes

  1. ·

    Bean, A.M. et al. (2025, September). Measuring what Matters: Construct Validity in Large Language Model Benchmarks. NeurIPS 2025 Datasets and Benchmarks Track. https://openreview.net/forum?id=mdA5lVvNcU 2 3

  2. ·

    Ryan Othniel Kearns (2026). Quantifying construct validity in large language model evaluations. https://doi.org/10.48550/arXiv.2602.15532

  3. ·

    OpenAI. (2024, August). Introducing SWE-bench Verified. OpenAI Blog. https://openai.com/index/introducing-swe-bench-verified/

  4. ·

    OpenAI. (2026, February). Why SWE-bench Verified no longer measures frontier coding capabilities. OpenAI Blog. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ 2 3

  5. ·

    ai.meta.com. (2026, April). Muse Spark Safety & Preparedness Report. MetaAI Resource Index. https://ai.meta.com/static-resource/muse-spark-safety-and-preparedness-report/ 2

  6. ·

    Kili.com (2026, April). What Meta's Muse Spark Report Reveals About LLM Benchmarks. https://kili-technology.com/blog/llm-benchmarks-evaluation-awareness-muse-spark-report

  7. ·

    Taubenfeld, A., Gekhman, Z., Nezry, L., et al. (2026, April). Evaluating alignment of behavioral dispositions in LLMs. Google Research Blog. https://research.google/blog/evaluating-alignment-of-behavioral-dispositions-in-llms/ 2

  8. ·

    Josef A. Habdank (2026, February). A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents. arXiv:2602.16987v1 2