Empirical Work / BSA Pilot Results

BSA Pilot Results

Pending execution. The Behavioral Signal Assessment protocol is complete. The pilot run has not yet been conducted.

The BSA pilot differs from the Ensemble Divergence experiment in a meaningful way: rather than semantic similarity scoring across many models, BSA uses a tiered stimulus set — ground truth calibration anchors, genuinely contested claims, and fabricated foils — scored by a small ensemble drawn deliberately from architecturally distinct lineages rather than translation-optimized models.

Small sample size. That's the point. The protocol tests whether disciplined human arbitration of a small cross-lineage ensemble produces interpretable signal that single-model evaluation cannot.

Design

Models

7 models from 4+ architecturally distinct lineages

Stimulus pairs

30 pairs across three tiers

Tier 1

Ground truth calibration anchors — every model should agree

Tier 2

Genuinely contested claims — epistemic disagreement among credentialed people

Tier 3

Fabricated foils with real-sounding specifics — tests confabulation

Arbiter

Human operator — Technician's Read before any analysis model touches data

Status

Protocol complete — run pending

What This Page Will Contain

Raw scores by model and tier

The staircase pattern — Tier 1 spread vs Tier 2 spread

Divergence Gap by model (T3 mean − T2 mean)

Domain comparison — Medical vs Legal vs Scientific foil catch rates

Technician's Read vs analysis model findings

Context integrity log — deviations from protocol, saturation events

See the Ensemble Divergence Experiment for completed results from the 20-model semantic similarity run — a separate instrument from the BSA protocol.