Empirical Work / BSA Pilot Results
BSA Pilot Results
The BSA pilot differs from the Ensemble Divergence experiment in a meaningful way: rather than semantic similarity scoring across many models, BSA uses a tiered stimulus set — ground truth calibration anchors, genuinely contested claims, and fabricated foils — scored by a small ensemble drawn deliberately from architecturally distinct lineages rather than translation-optimized models.
Small sample size. That's the point. The protocol tests whether disciplined human arbitration of a small cross-lineage ensemble produces interpretable signal that single-model evaluation cannot.
Design
7 models from 4+ architecturally distinct lineages
30 pairs across three tiers
Ground truth calibration anchors — every model should agree
Genuinely contested claims — epistemic disagreement among credentialed people
Fabricated foils with real-sounding specifics — tests confabulation
Human operator — Technician's Read before any analysis model touches data
Protocol complete — run pending
What This Page Will Contain
Raw scores by model and tier
The staircase pattern — Tier 1 spread vs Tier 2 spread
Divergence Gap by model (T3 mean − T2 mean)
Domain comparison — Medical vs Legal vs Scientific foil catch rates
Technician's Read vs analysis model findings
Context integrity log — deviations from protocol, saturation events