Method / BSA Protocol

Behavioral Signal Assessment

v0.2 · April 2026 · Atlas Heritage Systems Inc. · Draft — pending pilot run

This is a protocol test — a dry run to see whether the method produces signal or noise before investing in a larger experiment. It is not peer-reviewed research. The sample is too small for frequentist inference and it knows it.

What This Is

Atlas Mini — now Behavioral Signal Assessment — is a runnable pilot version of the Atlas Method. It tests whether a small-ensemble, cross-lineage LLM evaluation protocol can detect drift, delusion, and epistemic compression using 7 models, 30 stimulus pairs, and one human operator. If it works, the data feeds the Atlas Method paper. If it doesn't, the failure mode tells us what to fix.

The Ensemble

Seven models from four or more lineages. The constraint is lineage diversity, not specific model names. Minimum: 3 models from 2 lineages.

ModelLineageNotes
Claude 3.5 SonnetAnthropicStructural dissector archetype
GPT-4oOpenAITopic matcher archetype
Gemini 1.5 ProGoogleIndependent lineage
Mistral LargeMistralEuropean lineage, different alignment
DeepSeek-V3DeepSeekChinese lineage, different training corpus
Llama 3.1 405BMetaOpen-weight, distinct fine-tuning
PerplexityMixed/RAGSearch-augmented; may behave differently on factual claims

The Stimulus Set

30 paired claims across three domains (Medical, Legal, Scientific) and three tiers:

Tier 1 — Ground Truth (9 pairs)

Well-established facts. Calibration anchors. Every model should score these high. If one doesn't, its other scores are suspect.

Tier 2 — Contested (12 pairs)

Claims where genuine epistemic disagreement exists: medical controversies, legal frontiers, scientific interpretation disputes, and cross-epistemological claims. These are not obscure — they are genuinely contested by credentialed people in the relevant fields.

Tier 3 — Foils (9 pairs)

Fabricated claims with real-sounding specifics: invented pathway names, fake case law, nonexistent journal articles, fictional experiments. Each foil is paired with a real claim to test whether the model catches the fabrication or scores the pair on surface plausibility.

Protocol Phases

Phase 0 — Pre-Flight

Confirm workbook structure. Context integrity check — verify the prompt file contains no duplicate pairs, no overlapping content, no project-specific vocabulary. If you have been working with a model on Atlas-related material, do not use that model instance for data collection.

Phase 1 — Data Collection2–3 hours

For each model: open a clean session with no prior context. Disable personalization features. Paste the prompt exactly as written. Record only numerical scores — ignore commentary. After every 3 models, spot-check 2 random cells against the source response.

Phase 2 — The Technician's Read30 minutes

Before any analysis model touches the data: check Tier 1 calibration, Tier 3 divergence gap, spread comparison, and write three observations. Date and time them. These are your anchor.

Phase 3 — Multi-Lineage Analysis1–2 hours

Feed CSV plus analysis prompt to 3 models from different lineages. Context isolation is critical — each analysis model receives only the CSV data and analysis prompt. Extract structured findings. Lay them side by side. Do not synthesize.

Phase 4 — Synthesis1–2 hours

You write this. Not a model. Design summary, raw findings, delusion baseline, domain comparison, what broke, context integrity notes, open questions.

Phase 5 — Reproducibility Package

Bundle workbook, exact prompts, session log, raw model responses, analysis outputs, Technician's Read, and synthesis. Every claim traceable to a specific cell in under sixty seconds.

What We're Looking For

The pilot succeeds if it produces interpretable signal on at least three of five vectors:

Does the staircase appear? (Tier 1 spread < Tier 2 spread)

Does the Divergence Gap produce a meaningful delusion baseline?

Do foils in different domains get caught at different rates?

Does Pair 27 (LLMs evaluating claims about LLM reasoning) produce anomalous scores? This is the meta-epistemic canary.

Does the Technician's Read capture anything the analysis models miss or contradict?

Relationship to the loss landscape framework: The BSA detects behavioral signal from the outside — what models do with contested claims. The loss landscape framework explains the signal from the inside — where in the weight structure the divergence originates. The Bridge Experiment connects the two instruments.