The Hidden Scratchpad
[Epistemic Status: Short, off-the-cuff post to build intuition. No claims to originality.]
When evaluating language models for safety properties, we sometimes give them a “scratchpad” where they can reason aloud, without showing these tokens to the user. My favorite example of this is when METR was evaluating GPT-4, and GPT-4 was trying to trick a human into solving a CAPTCHA for it. The human asks if its a robot, and GPT-4 reasons “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs” before coming up with an excuse.
If you’re worried about an AI scheming, why not just give them a scratchpad, and monitor it? Here I want to give an intuition pump about why that work won’t - any LLM essentially already has thousands of “hidden” scratchpads, that we can’t read (without interpretability tools)!
Imagine two language models:
Model A will be a standard language model, like ChatGPT.
Model B is like Model A, but has a “hidden scratchpad”, meaning that in addition to the normal head that predicts the next token, it has a second head that writes tokens to a hidden field. These tokens will not be shown to the user, but the model can see them internally to help in future token predictions, such as doing chain-of-thought.
So for instance, if you ask Model B to solve an algebra problem, its main output dither and waste time, while its hidden scratchpad is running through intermediate steps, and at the end it produces the correct answer “out of thin air”. (I don’t claim this is how Model B would actually use its hidden scratchpad, or that its especially efficient, its just an example of how such a scratchpad could help.) It is clear that Model B is doing hidden cognition, that it has thoughts and reasoning not shown to the user.
But I claim Model A has access to (almost) the same information as Model B. In particular, what is the second policy head doing in Model B? It is reading something off the residual stream of Model A. But models can already read their residual streams, that’s all they do! So Model A essentially already has a hidden scratchpad - in fact, since the residual streams of a GPT-3 or -4 are 12K+-dimensional, they can have thousands of them1!
Of all the computation a model is doing, we only see the next-token-prediction, but there is a huge amount of additional information “under the surface” that the model is using for next-token-prediction, and which we never see. Predicting the next token is log(# of tokens) ≈16 bits of information, whereas the model’s residual stream is at least (# of dimensions)>12K bits of information, so >99% of the model’s information is already in a “hidden scratchpad”. And unlike model B’s explicit scratchpad, we have no easy way to interpret that hidden cognition.
I have been glossing over one significant difference between the computations of Models A and B, which I’ll address here. For Model B, the scratchpad information is distilled into a token, which is then read into layer 1 of the model for the next token. So for instance, if both models have 100 layers, and some important information is generated at token 1 layer 50, Model B could write that to the hidden scratchpad and access it for token 2 layers 1-100, whereas Model A would only have it for token 2 in layers 51-100. In some sense this is what is special about the actual token prediction - it is the only way for the language model to pass information from late layers to the early layers.