4 Comments

I looked into Question 4 as well and when considering encoder-only or decoder-only transformers, neither uses the extra "Encoder-Decoder Attention" Layer, so they're almost identical in their structure. The only difference is that the encoder uses normal self-attention, whereas the decoder uses masked self-attention. However, from my understanding, this is only relevant at training time, so the decoder does not "cheat" by attending to future (known) tokens. But since at inference time, the decoder by definition does not have any future tokens to look at, I really think the difference is only in terms of training (and maybe also when "ingesting" the prompt). This would explain why "encoder-only" models such as BERT are seen as better at classification (you look at the whole sequence) whereas "decoder-only" models such as GPT are better at text generation.

Expand full comment

That's my understanding now as well! The only thing I'd add is that the masking is relevant at inference because you have to compute the intermediate layers on prior tokens. For instance, to get (the residual stream after applying) layer 5 on token 30, you do calculations on layer 4 tokens 1-30, but to get the layer 4 at token 29, you need to do calculations on layer 3 tokens 1-29, so the masking is used even for inference.

Expand full comment

On Open Question 2., this paper possibly explains why: https://arxiv.org/pdf/1608.05859.pdf

Expand full comment
Comment deleted
May 5, 2023
Comment deleted
Expand full comment

Good spotting, thank you! Should be fixed now.

Expand full comment