[Research Update] Sparse Autoencoder features are bimodal
aizi.substack.com
Overview The sparse autoencoders project is a mechanistic interpretability effort to algorithmically find semantically meaningful “features” in a language model. A recent update hints that features learned by this approach separate into two types depending on their maximum cosine similarity (MCS) score against a larger feature dictionary:
[Research Update] Sparse Autoencoder features are bimodal
[Research Update] Sparse Autoencoder features…
[Research Update] Sparse Autoencoder features are bimodal
Overview The sparse autoencoders project is a mechanistic interpretability effort to algorithmically find semantically meaningful “features” in a language model. A recent update hints that features learned by this approach separate into two types depending on their maximum cosine similarity (MCS) score against a larger feature dictionary: