Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
aizi.substack.com
Abstract A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic’s Bricken et al). We train a sparse autoencoder on OthelloGPT
Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Research Report: Sparse Autoencoders find…
Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Abstract A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic’s Bricken et al). We train a sparse autoencoder on OthelloGPT