Framework developmentApril 24, 2026

The Operator Is the Instrument

The operator is a required variable. The question is whether you document that variable or wave your hands at the massive stack of confounds that comes with the territory.


I. What I Was Doing Before

The original Atlas diagnostic approach was built on a reasonable instinct: if you want to measure what a model does, get out of the way. The Closed-ended Interaction Stimulus Protocol (CISP) kept the investigator upstream and done — fixed prompt, fixed output format, constrained response window. The operator selects the stimulus, presses run, reads the number. Clean. Controlled. Auditable.

It made sense for what it was trying to do. DivTest was measuring divergence across an ensemble — how far apart are these models on this question. That's a point-to-point measurement, and for a point-to-point measurement, operator isolation is correct methodology. You want the ruler held still.

What it couldn't see was anything that moved.

II. The Problem That Broke the Model

I had a benchmark sitting with no home. DivTest Run 4 was ready to go. The panel was picked out; I just had to run the test. I'd been struggling to find an interpretive angle for it. Which should have been a red flag. Going into a data-gathering run with ulterior motives is a no-no. That unresolved state, the need to "close" the test, was pressure. And pressure, as it turns out, is exactly what I was studying in the models.

Skywork flagged the Frame Axis idea as out of scope. Something in the session said DivTest is the answer — weighted by the Atlas context I'd loaded, reinforced by the work I'd already put in. I agreed. It was the easy thing to do. I had a hanging benchmark and closure felt better than sitting with a question that didn't have an answer yet.

That was the compulsion to close. Mine, not Skywork's. The machine handed me the resolution I was already reaching for. I took it.

What broke the model wasn't a data problem. It was a perception problem. I caught myself describing the frame axes as something I was "too busy" to develop, or something that "didn't fit." Neither of which was true. It fit. I just wasn't entertaining the container it fit into because I'd already put work into a different box. That's the wrong way to do science. The hard thing is to look at what you've built and say: I spent my time developing the wrong tools.

The actual problem was geometric. DivTest produces a single number per model in an ensemble. Each number is a point. Across those points, I thought I was building a map. What I was building was a spread. Those are two different things.

You can't compute PyHessian trajectory between two models — PyHessian measures curvature in a single model's loss landscape, not the distance between two of them — you can only compute where a single model lands. Landing implies resolution. Resolution requires knowing where something came from, which means at least two points and a direction. To measure behavioral geometry the same logic follows: you need coordinates — position in space over time as variables are applied. A spread tells you how far apart things are. It tells you nothing about how they got there, or where they're going. You can't derive trajectory from a distance measurement. The instrument wasn't broken. It was aimed at the wrong substrate. The question I actually wanted to answer required a generator in the loop. Which meant the isolated Asymmetrical Arbiter had to be front and center causing a ruckus in the ring.

No ruckus. No ring.

III. What Had to Change

Once I accepted that the resolution event doesn't exist without a generator, operator isolation stops being rigor and starts being a category error. I wasn't cleaning the data by removing myself. I was canceling the phenomenon.

Baseline, delta, trajectory, resolution — all four require someone in the loop generating the torque. The operator isn't a confound to control away. The operator is a required variable. The question is whether you document that variable or wave your hands at the massive stack of confounds that comes with the territory.

The downstream observer declaration is one answer. It's the investigator saying: "Look what I can do!" — and following with: here is where I stand relative to the data and this is how I address the problems that come with it. Here is the frame I used, the texture of my prompt, the state I was in, the stimulus registry number that maps to this run. The checks and balances in place to ensure data fidelity. Here is what that means for what I can and cannot claim from this experiment.

That's not an apology for being human. That's the correct epistemological position for someone running a Tier C instrument. And it clarifies what Tier means. Tier isn't a quality judgment — "Tier A" data isn't better than "Tier C" data. Tier is a question of documentation. How much investigator presence does the instrument require, and how much declaration does that presence demand? Tier A needs almost none — the operator is upstream, the instrument is closed, the prompt is fixed. Tier C needs all of it — because in Tier C, I am the burst. I have to document that relationship accordingly.

IV. What This Means for the Data

The 4/18 cutoff for legacy data is clean because it marks a genuine methodological seam, not a quality break. Data collected before that date asked a stopwatch question. Data collected after asks a thermometer question. Both instruments are accurate. Neither can do what the other does.

DivTest is the stopwatch. It measures elapsed distance between two points — how far apart are these models on this question, at this moment. Start, stop, read the gap. That's a real measurement and it answered a real question.

FVE-1 is the thermometer. It measures state over time as variables are applied — where does this model sit relative to its home register as the session develops, what happened to the temperature when the investigator alters the frame, what direction was it aimed at before it was fired. Because this research program is studying the thing that makes the temperature change — not just the temperature — the thermometer has to be in the room while the heat is being applied. A stopwatch started after the fact can tell you how long something took. It cannot tell you what got hot, when, or why.

The legacy data isn't compromised. It answered the stopwatch question correctly. The new data is answering a different question with a different instrument. The two pipelines aren't competing. They're incommensurable — not as a judgment, as a description. You cannot derive trajectory from an elapsed time reading, and you cannot get a clean distance measurement from a dataset built to track temperature change. Different instruments. Different questions. Both valid in their own room.

V. How You Run It Again

There is an actual answer to the replicability question. It requires taking responsibility for your state throughout the experimental process. Fidelity gates on every question and in the data processing pipeline. Human-in-the-loop analysis, with an asymmetrical weight on human interpretation. Chain of custody and provenance on every step. You know. Like a scientist would.

These aren't workarounds for solo operation — they're a higher bar than most multi-investigator studies clear. And that's the point. The documentation standards large labs get away with aren't neutral. They're load-bearing. The models that confidently resolve to the wrong answer didn't get that way by accident — they got that way because "two runs asked the same question" was allowed to count as rigor. The culture that accepts that standard is the same culture that will dismiss this work for lacking institutional affiliation. I'm not interested in catering to it.

So I'm building a system that requires replication. The git ships. The results and tools will be public facing.

Let's see if the eggheads can keep up.


Atlas Heritage Systems · KC Hoye, PI · April 2026