[Epistemic Status: Short, off-the-cuff post to build intuition. No claims to originality.] When evaluating language models for safety properties, we sometimes give them a “scratchpad” where they can reason aloud, without showing these tokens to the user. My favorite example of this is when
The Hidden Scratchpad
The Hidden Scratchpad
The Hidden Scratchpad
[Epistemic Status: Short, off-the-cuff post to build intuition. No claims to originality.] When evaluating language models for safety properties, we sometimes give them a “scratchpad” where they can reason aloud, without showing these tokens to the user. My favorite example of this is when