Training Loss → Steps
I
Sharp drop — token frequencies, punctuation
II
Steady descent — grammar → facts → reasoning
III
Plateau — diminishing returns, data/model saturation