Framework
Loss Landscape Vocabulary Framework
v12 · April 2026 · Atlas Heritage Systems Inc. · Working document — not a finished product
Global Geometry
Properties that only appear when you look at the loss landscape as a whole: how minima connect, when behavior changes regime, and which weight configurations are secretly the same solution. These are not local qualifiers — they live between basins and across training trajectories. No single-point terrain or navigator measurement can see them. Promoted from coverage gaps (v11) to first-class terms (v12) following Nemotron-3-Super-120B adversarial review, April 2026.
Whether two minima are joined by a low-loss path in parameter space. Connected basins admit smooth interpolation between solutions; isolated basins require climbing a high-loss barrier to move between them. Connectivity is a global property no local qualifier can see — whether an apparently separate minimum is part of a larger connected valley is invisible from inside either minimum. A key open question for the Atlas framework: are the archaeological sinks isolated basins or branches of a connected low-loss manifold?
Garipov et al. (2018) loss surfaces, mode connectivity, and fast ensembling; Draxler et al. (2018) essentially no barriers in neural network energy landscape
Families of weight configurations that implement the same function because of architectural symmetries. Permuting neurons, rescaling layers with inverse compensation, or flipping sign conventions can move the model far in parameter space while leaving behavior unchanged. Symmetry orbits explain why many distinct-looking minima are functionally the same basin. Distance in parameter space is not distance in function space without accounting for symmetry orbits first.
Entezari et al. (2022) the role of permutation invariance in linear mode connectivity; Ainsworth et al. (2022) git re-basin: merging models modulo permutation symmetries
Regime shifts in model behavior that emerge as a training control parameter crosses a threshold, often with smooth loss but discontinuous internal structure. Grokking is the canonical example — performance snaps from memorization-like to generalization-like after extended training at nearly constant loss. Phase transitions are global training phenomena, not single points in the landscape. The loss surface does not signal them. Behavioral metrics (accuracy, representation similarity) do. The transition happens in algorithm space while the terrain barely moves.
Power et al. (2022) grokking: generalization beyond overfitting; Nakkiran et al. (2019) deep double descent: where bigger models and more data hurt
Distinct functional algorithms that live on the same broad basin. Two checkpoint paths can converge to similar loss and perplexity while encoding different internal computations — different circuits, different attention patterns, different representation geometry. Branch structure captures the fact that a single connected region of low loss can contain multiple algorithmic solutions. Where regime switch describes when a model changes algorithm during training, algorithmic branches describe co-existing algorithms within the same low-loss region after training.
Zhang et al. (2021) can you learn an algorithm?; related work on functional clustering of representations and mechanistic interpretability circuit analysis