u/141_1337

▲ 22 r/singularity+1 crossposts

​

## Abstract

Post-training makes language models more decisive without necessarily making them more accurate — and we find a structural reason why.

Across staged post-training checkpoints from three architecture families, we measure the layer at which a transformer becomes **causally committed** to its next-token prediction, and track how that boundary evolves through supervised fine-tuning, preference optimization, and reinforcement learning.

**Base models** already exhibit a rough commitment structure.

**Supervised fine-tuning** refines this into a sharp boundary — suppressing early-layer causal influence and concentrating commitment into the later layers.

**But once the boundary stabilizes, reinforcement learning does not move it:** across three families and four RL methods, the commitment layer shifts by 0–1 layers.

What RL *does* change is how decisively the model locks in at that fixed point — the geometry at the commitment layer compresses monotonically through each post-training stage, becoming lower-dimensional and more concentrated with each stage of training.

The earlier layers, where the model assembles candidate answers, remain largely unchanged. Weight matrix rank is nearly constant across all stages and architectures, and an independent logit-lens measuremen.

zenodo.org
u/141_1337 — 17 days ago