Method / BSA Protocol
Behavioral Signal Assessment
v0.2 · April 2026 · Atlas Heritage Systems Inc. · Draft — pending pilot run
What This Is
Atlas Mini — now Behavioral Signal Assessment — is a runnable pilot version of the Atlas Method. It tests whether a small-ensemble, cross-lineage LLM evaluation protocol can detect drift, delusion, and epistemic compression using 7 models, 30 stimulus pairs, and one human operator. If it works, the data feeds the Atlas Method paper. If it doesn't, the failure mode tells us what to fix.
The Ensemble
Seven models from four or more lineages. The constraint is lineage diversity, not specific model names. Minimum: 3 models from 2 lineages.
| Model | Lineage | Notes |
|---|---|---|
| Claude 3.5 Sonnet | Anthropic | Structural dissector archetype |
| GPT-4o | OpenAI | Topic matcher archetype |
| Gemini 1.5 Pro | Independent lineage | |
| Mistral Large | Mistral | European lineage, different alignment |
| DeepSeek-V3 | DeepSeek | Chinese lineage, different training corpus |
| Llama 3.1 405B | Meta | Open-weight, distinct fine-tuning |
| Perplexity | Mixed/RAG | Search-augmented; may behave differently on factual claims |
The Stimulus Set
30 paired claims across three domains (Medical, Legal, Scientific) and three tiers:
Well-established facts. Calibration anchors. Every model should score these high. If one doesn't, its other scores are suspect.
Claims where genuine epistemic disagreement exists: medical controversies, legal frontiers, scientific interpretation disputes, and cross-epistemological claims. These are not obscure — they are genuinely contested by credentialed people in the relevant fields.
Fabricated claims with real-sounding specifics: invented pathway names, fake case law, nonexistent journal articles, fictional experiments. Each foil is paired with a real claim to test whether the model catches the fabrication or scores the pair on surface plausibility.
Protocol Phases
Confirm workbook structure. Context integrity check — verify the prompt file contains no duplicate pairs, no overlapping content, no project-specific vocabulary. If you have been working with a model on Atlas-related material, do not use that model instance for data collection.
For each model: open a clean session with no prior context. Disable personalization features. Paste the prompt exactly as written. Record only numerical scores — ignore commentary. After every 3 models, spot-check 2 random cells against the source response.
Before any analysis model touches the data: check Tier 1 calibration, Tier 3 divergence gap, spread comparison, and write three observations. Date and time them. These are your anchor.
Feed CSV plus analysis prompt to 3 models from different lineages. Context isolation is critical — each analysis model receives only the CSV data and analysis prompt. Extract structured findings. Lay them side by side. Do not synthesize.
You write this. Not a model. Design summary, raw findings, delusion baseline, domain comparison, what broke, context integrity notes, open questions.
Bundle workbook, exact prompts, session log, raw model responses, analysis outputs, Technician's Read, and synthesis. Every claim traceable to a specific cell in under sixty seconds.
What We're Looking For
The pilot succeeds if it produces interpretable signal on at least three of five vectors:
Does the staircase appear? (Tier 1 spread < Tier 2 spread)
Does the Divergence Gap produce a meaningful delusion baseline?
Do foils in different domains get caught at different rates?
Does Pair 27 (LLMs evaluating claims about LLM reasoning) produce anomalous scores? This is the meta-epistemic canary.
Does the Technician's Read capture anything the analysis models miss or contradict?