Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Abstract
A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic’s Bricken et al). We train a sparse autoencoder on OthelloGPT, a language model trained on transcripts of the board game Othello, which has been shown to contain a linear representation of the board state, findable by supervised probes. The sparse autoencoder finds 9 features which serve as high-accuracy classifiers of the board state, out of 180 findable with supervised probes (and 192 possible piece/position combinations). Across random seeds, the autoencoder repeatedly finds “simpler” features concentrated on the center of the board and the corners. This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model.
Introduction
There has been a recent flurry of research activity around Sparse Autoencoders for Dictionary Learning, a new approach to finding interpretable features in language models and potentially “solving superposition” (Sharkey et al, Anthropic’s Bricken et al, Cunningham et al.). But while this technique can find features which are interpretable, it is not yet clear if sparse autoencoders can find particular features of interest (e.g., features relevant to reducing AI risk).
This research report seeks to answer the question of whether sparse autoencoders can find a set of a-priori existing, interesting, and interpretable features in the OthelloGPT language model. OthelloGPT, as the name suggests, is a language model trained on transcripts of the board game Othello to predict legal moves, but was found to also linearly encode the current board state (Nanda, Hazineh et al). That is, for each of the 64 board positions, there were “board-state features” (linear mappings from the residual stream to \R^3) that classify the state at that position between [is empty] vs [has active-player’s piece] vs [has enemy’s piece], and these board-state features can be found by the supervised training of a linear probe. These board-state features are an exciting testbed for sparse autoencoders because they represent a set of “called-shot” features we hope to find, and which are extremely interpretable and correspond to natural human thinking1. If the sparse autoencoder can find these features, this is some evidence that they will find relevant and important features in language models. Conversely, if the sparse autoencoders can’t find these features, that indicates a limitation of the method, and provides a test case where we can adjust our training methods until we can find them.
Overview
Here we:
Train an OthelloGPT model from scratch
Train a linear probe to classify the board states (replicating Hazineh et al) from an intermediate layer of OthelloGPT.
Train a sparse autoencoder on the same layer of OthelloGPT
Assess whether the features found by the sparse autoencoder include the linear encoding of the current board state that the linear probe is able to find.
Retrain the sparse autoencoder with different random seeds, and analyze which features are found.
Methods
Training OthelloGPT
We first trained an OthelloGPT model from scratch, following the approach of Li et al. Our model is a 25M parameter, 8-layer, decoder-only transformer, with residual stream dimension d_model=512 (identical to Li et al’s model). It is trained to do next-token-prediction of random transcripts of Othello games, with each possible move being encoded as a separate token, resulting in a vocabulary size of 66 (64 from the positions on the boards, plus 2 special tokens). The model was trained on a corpus of 640K games for 2 epochs, using the Adam optimizer with learning rate 1e-3.
The trained model had a 5% error rate in predicting next legal moves. This is far higher than Li et al’s 0.01%, which I believe is due to my shorter training run on smaller data2. Despite this relatively high error rate, the model has been trained to a point where it exhibits the linearly-encoded board state described by Hazineh et al, which we will show in the next section.
Training Linear Probes
We next train linear probes using its residual stream to classify the contents of individual board positions. This serves two purposes: first to confirm that our OthelloGPT model linearly encodes the board state, and secondly serves as a baseline for the classification accuracy we can expect from any sparse autoencoder features.
As in Nanda and Hazineh et al, we found that we could train higher accuracy probes if we group positions into “empty/own/enemy” rather than “empty/black/white”. Following Nanda’s recommendation, we trained our probes on the residual stream of the model just after the MLP sublayer of layer 6. Each probe is a linear classifier from the residual stream (\R^512) to the three classes (\R^3), trained to minimize cross-entropy between the true labels of the board state, and the classifier’s predictions. We train one probe for each of the 64 board positions, resulting in 3*64 directions in activation space3. As in Nanda’s work, we found that our classifiers had a greater accuracy if we restricted them to the “middle” turns of each Othello game, in our case turns [4, 56). The probes were trained on 100K labelled games, for 1 epoch, using the Adam optimizer with learning rate 1e-3.
The resulting probes predict board positions with an error rate of 10%. While this is much larger than Hazineh et al’s 0.5% error rate, it is far better than chance, and indicates that there is linear structure to find. We also measure classification accuracy with AUROC, since this allows us to compare probe and feature directions as classifiers. In particular, for each position, for classes A/B/C with scores a/b/c, we use the “rectified directions” a-0.5(b+c) as a score for class A vs (B or C). We find that all of the 192 rectified probe directions have an AUROC greater than .9, with the exception of the 12 features corresponding to the central 4 tiles (which begin the game filled, and therefore might be handled differently by the language model). We will therefore use .9 as the (semi-arbitrary) threshold for a “high accuracy” classifier.
Training The Sparse Autoencoder
Our sparse autoencoder architecture is based on that in Cunningham et al, consisting of a single hidden layer with ReLU activations, tied encoder-decoder weights and a bias on the encoder but not decoder layer. As with the probes, we trained on layer 6 of the GPT model, and turns [4, 56). We used a feature ratio R=2 (1024 features for a 512-dimensional model), and a sparsity coefficient α=7.7e-2. This sparsity coefficient was chosen after a hyperparameter sweep in order to minimize the sum of unexplained variance and average number of features active. The autoencoder was trained on a corpus of 100K games for 4 epochs, using the Adam optimizer with learning rate 1e-3.
The resulting autoencoder had an average of 12% features active, 17% unexplained variance, and 0.2% dead features on the test set.
Results
SAE Features as Current-Board Classifiers
For each of the 1024 sparse autoencoder features, we can measure if they correctly classify the current board state as an empty/own/enemy piece. We find that there are several features which serve as highly accurate classifiers for whether a tile is empty.
Visual inspection of the boards confirms that Feature 395 correctly classifies if position 43 is empty or filled:
The sparse autoencoder found 9 features which act as classifiers with AUROC>.9, all for assessing when the tile is empty vs (own+enemy). The best non-empty classifier is Feature 525, classifying Position 7 with an AUROC of .86:
Here are the top- and random-activating examples for this feature:
It should be noted that both of these classification tasks are computationally simpler that the other classification tasks: checking if a tile is empty is just querying the context window for the corresponding token, and since corners cannot be flipped, checking if a corner is an enemy piece is just querying the context window for that token an odd number of turns ago. (Though that’s not what the classifiers are doing, at least directly, since they only have access to the residual stream and not the tokens themselves.)
The feature best at classifying a non-corner, non-empty token is feature 688, which has an AUROC of .828:
Overall, the sparse autoencoder has found some classifier features, but the vast majority of features do not classify the current board state, and the majority of board states are not well-classified by features. The features that are good classifiers correspond to “easier” classification tasks, ones that do not engage with the complexities of pieces flipping.
Which Features are Learned?
Knowing that only some classifiers are found by the sparse autoencoder, we should ask:
Which classifiers?
Are these directions “easier to find”, or would the autoencoder find other ones if retrained?
To test this, I trained the autoencoder 10 times, with different random seeds, and investigated which features were found. The only differences between these autoencoder training runs were: the initialization of the autoencoder weights, and the ordering of the training data (though the overall training set was the same).
I then checked if each autoencoder had a feature which acts as a classifier for a position with AUROC>.9. This is the result:
This indicates that the inner-ring features are in some way easier for the autoencoder to learn, either due to the dataset used or the way OthelloGPT represents them. It seems likely that these are the most prominent features to learn since these moves are playable from the beginning of the game, and that these moves have important effects on whether other moves are playable. The lack of classifiers for the central tiles is explained by the difficulty of classifying these tiles even with linear probes (recall that the probes there had AUROC<.9). The corner classifiers also seem to be easier to learn, and are the only features with AUROC>.9 which classify enemy vs own pieces.
Overall, we can conclude that the autoencoder has a preference for learning some features over others. These features might be more “prominent” in the residual stream, or in the dataset, or in some other way, and I have not tested these hypotheses yet.
SAE Features as Legal-Move Classifiers
Since the model is trained to predict legal moves, one might expect it to learn features for if a move is legal. And unlike in the autoencoders-on-text case, there are fewer tokens than autoencoder features, so it would be easy to allocate 60/1024 features for predicting tokens, if that is useful to the sparse autoencoder.
We find the autoencoder often finds features that classify whether a move is legal. However, this is confounded by the overlap of “move is legal” and “tile is empty” (the former is the later plus some extra conditions). There were several features that are decent legal-move classifiers, but when you look at their activation distributions its clear they are actually empty-tile classifiers that score well on legal move classification because P(legal | empty) was high:
Some density plots looked like Feature 722/Position 26, showing clear confounding, and other look like this, where the distributions are nearly identical:
Finally, we can compare the AUROCs of the probes to the SAE features (both as content predictors and legal move predictors):
Cosine Similarities
Finally, we can directly compare the directions found by probes and the autoencoders. In particular, for each of the 192 rectified probe directions, we computed its maximum cosine similarity across the 1024 autoencoder directions. This is the result:
We can conclude that the autoencoder directions are relatively close to the probe directions in activation space, but do not perfectly match. We shouldn’t be worried about this lack of perfect matching since a correlation of .6 is enough both in theory and in practice for the autoencoder features to be ~perfect classifiers.
One Really Cool Case Study
As I was investigating high-correlation features that were bad classifiers (by AUROC), I found several features like this one, which shows clear bimodality that isn’t aligned with empty/own/enemy pieces:
For this feature, when I looked at the top-activating boards, I found them to be highly interpretable. See if you can spot the pattern:
It looks like this feature activates when positions F2-F6 are all white! And what’s incredible is the “partial activation” in the bottom row: the feature activates at 12 when positions E2-E6 are all white! That seems like an extremely understandable “near-miss” for the feature, which is astonishing to me.
We should here acknowledge that Othello and OthelloGPT can be harder to interpret than English text. Whereas humans will find patterns and themes in text for fun, I found my brain was very much not wired for analyzing Othello boards, and therefore in most cases I could only test feature interpretability by programmatically testing individual hypotheses. Therefore, I have not been able to analyze the vast majority of OthelloGPT features, and they may have interpretable meanings like the above that simply do not show up on my metrics. If anyone wants to do a lot of case studies of individual features, I’m happy to share the tools I have.
Conclusion
We have shown that out of a set of 180 a-priori interesting and interpretable features in OthelloGPT, sparse autoencoders find only 9 of them. While this confirms the promise of sparse autoencoders in Cunningham et al and Bricken et al, that they find interpretable features, it also underlines the limitations of the approach: this is the first work demonstrating that sparse autoencoders can fail to find a concrete set of interesting, interpretable features, and suggests that currently-existing sparse autoencoders cannot “fully” interpret a language model. We hope that these results will inspire more work to improve the architecture or training methods of sparse autoencoders to address this shortcoming. Finally, we hope we have shown that OthelloGPT, with its linear world state, is useful for measuring if unsupervised techniques find important interpretable directions, and can be a fruitful place to test interpretability techniques.
Future work:
Redo this analysis on the fully-trained OthelloGPT created by Li et al.
Adjust the autoencoder architecture until it is able to find more of the features we hope to see. Possible architectural changes include:
Untied encoder/decoder weights as in Anthropic’s Bricken et al.
Update the architecture using the tricks described in Anthropic’s updates.
Update the loss function to include an orthogonality penalization term, as described by Till.
(Low-priority) Redo this analysis on the MLP layer of the transformer (as Bricken et al do) instead of the residual stream. (The MLP layers may not linearly represent the board state, so first we’d want to verify this with a new set of probes.)
(Low-priority) Continue investigating individual autoencoder features.
Though, notably, this was not the first way people expected OthelloGPT to encode the board state. Since humans conceptualize Othello as a game between two players, the original authors tried to find linear features representing whether a square was empty/black/white. The resulting classifiers were mediocre, with an error rate of 20% or more. However, Neel Nanda found that since the language model “plays both sides”, it linearly encodes the board as empty/own/enemy pieces (i.e., grouping “black on even turns” with “white on odd turns” instead of “black on odd turns”), and Hazineh et al find probes trained to do this classification can have an error rate as low as 0.5%.
I plotted OthelloGPT’s error rate across its training, and it followed a straightforward scaling law that would have reached 0.01% error rate with a few more OOMs of training data. I opted not to train to that level, but I plan to redo my analyses on Li et al’s OthelloGPT when I can get my hands on it.
Though we should note that the relative difference between two directions is more important the directions themselves. Since the predictions go through a softmax, the classifier with directions (A,B,C) produces the same results as with directions (A+X, B+X, C+X) for any X. The invariant properties learned by the classifier are the differences B-A and C-A, or the “rectified directions” like A-0.5(B+C).