I looked into Question 4 as well and when considering encoder-only or decoder-only transformers, neither uses the extra "Encoder-Decoder Attention" Layer, so they're almost identical in their structure. The only difference is that the encoder uses normal self-attention, whereas the decoder uses masked self-attention. However, from my understanding, this is only relevant at training time, so the decoder does not "cheat" by attending to future (known) tokens. But since at inference time, the decoder by definition does not have any future tokens to look at, I really think the difference is only in terms of training (and maybe also when "ingesting" the prompt). This would explain why "encoder-only" models such as BERT are seen as better at classification (you look at the whole sequence) whereas "decoder-only" models such as GPT are better at text generation.
How does GPT-3 spend its 175B parameters?
I looked into Question 4 as well and when considering encoder-only or decoder-only transformers, neither uses the extra "Encoder-Decoder Attention" Layer, so they're almost identical in their structure. The only difference is that the encoder uses normal self-attention, whereas the decoder uses masked self-attention. However, from my understanding, this is only relevant at training time, so the decoder does not "cheat" by attending to future (known) tokens. But since at inference time, the decoder by definition does not have any future tokens to look at, I really think the difference is only in terms of training (and maybe also when "ingesting" the prompt). This would explain why "encoder-only" models such as BERT are seen as better at classification (you look at the whole sequence) whereas "decoder-only" models such as GPT are better at text generation.
On Open Question 2., this paper possibly explains why: https://arxiv.org/pdf/1608.05859.pdf