In a fully-connected MLP, parameter count grows with dim^2, which explodes if dim= d_model*num_tokens. For example, 100 words/tokens represented 100-dimensionally consists of 10000 numbers, and a single MLP layer on this would require (10000)^2 parameters. That’s already the size of the smallest GPT-2, for a single layer with a smaller d_model and num_tokens!
An alternative: 1) let tokens only communicate if it’s “important”, while 2) reusing parameters. Thus attention heads, which determine what tokens are relevant to each other, and only pass information if its relevant. Then, optionally, tokens can have little a nonlinear evolution, as a treat, the MLP in the transformer layer. Sharing the MLP across all positions saves parameters, most importantly because this MLP’s size is proportional to d_model^2, not (d_model*num_tokens)^2.
Other desiderata:
This architecture isn’t too compute-intensive.
During training on next-token-prediction, you parallelize/reuse computations since you’re predicting each next token. For a typical 2048-token context window, that’s a 2000x training speedup over an RNN!
So that’s why transformers are good: by breaking up a position-by-dimension computation into 1) attention heads passing information between tokens and 2) processing that information token-wise, you solve the fundamental problem of passing information between tokens in a parameter-efficient way.
Filler words