4 Comments
Nov 1, 2023Liked by Robert Huben

I looked into Question 4 as well and when considering encoder-only or decoder-only transformers, neither uses the extra "Encoder-Decoder Attention" Layer, so they're almost identical in their structure. The only difference is that the encoder uses normal self-attention, whereas the decoder uses masked self-attention. However, from my understanding, this is only relevant at training time, so the decoder does not "cheat" by attending to future (known) tokens. But since at inference time, the decoder by definition does not have any future tokens to look at, I really think the difference is only in terms of training (and maybe also when "ingesting" the prompt). This would explain why "encoder-only" models such as BERT are seen as better at classification (you look at the whole sequence) whereas "decoder-only" models such as GPT are better at text generation.

Expand full comment
Aug 10, 2023Liked by Robert Huben

On Open Question 2., this paper possibly explains why: https://arxiv.org/pdf/1608.05859.pdf

Expand full comment
deletedMay 5, 2023Liked by Robert Huben
Comment deleted
Expand full comment