loading…

TRANSFORMER ATTENTION

A transformer's self-attention matrix lights up across layers — query-key dot products form a heatmap showing which tokens attend to which, softmax normalizes each row, multi-head attention streams from input tokens to a final representation.