I looked into Question 4 as well and when considering encoder-only or decoder-only transformers, neither uses the extra "Encoder-Decoder Attention" Layer, so they're almost identical in their structure. The only difference is that the encoder uses normal self-attention, whereas the decoder uses masked self-attention. However, from my understanding, this is only relevant at training time, so the decoder does not "cheat" by attending to future (known) tokens. But since at inference time, the decoder by definition does not have any future tokens to look at, I really think the difference is only in terms of training (and maybe also when "ingesting" the prompt). This would explain why "encoder-only" models such as BERT are seen as better at classification (you look at the whole sequence) whereas "decoder-only" models such as GPT are better at text generation.
That's my understanding now as well! The only thing I'd add is that the masking is relevant at inference because you have to compute the intermediate layers on prior tokens. For instance, to get (the residual stream after applying) layer 5 on token 30, you do calculations on layer 4 tokens 1-30, but to get the layer 4 at token 29, you need to do calculations on layer 3 tokens 1-29, so the masking is used even for inference.
I looked into Question 4 as well and when considering encoder-only or decoder-only transformers, neither uses the extra "Encoder-Decoder Attention" Layer, so they're almost identical in their structure. The only difference is that the encoder uses normal self-attention, whereas the decoder uses masked self-attention. However, from my understanding, this is only relevant at training time, so the decoder does not "cheat" by attending to future (known) tokens. But since at inference time, the decoder by definition does not have any future tokens to look at, I really think the difference is only in terms of training (and maybe also when "ingesting" the prompt). This would explain why "encoder-only" models such as BERT are seen as better at classification (you look at the whole sequence) whereas "decoder-only" models such as GPT are better at text generation.
That's my understanding now as well! The only thing I'd add is that the masking is relevant at inference because you have to compute the intermediate layers on prior tokens. For instance, to get (the residual stream after applying) layer 5 on token 30, you do calculations on layer 4 tokens 1-30, but to get the layer 4 at token 29, you need to do calculations on layer 3 tokens 1-29, so the masking is used even for inference.
On Open Question 2., this paper possibly explains why: https://arxiv.org/pdf/1608.05859.pdf
Good spotting, thank you! Should be fixed now.