I think the relation between K-means and sparse dictionary learning (essentially K-means is equivalent to an L_0=1 constraint) is already well-known in the sparse coding literature? For example see this wiki article on K-SVD (a sparse dictionary learning algorithm) which first reviews this connection before getting into the nuances of k-SVD.
Were the SAEs for this comparison trained on multiple passes through the data, or just one pass/epoch? Because if for K-means you did multiple passes through the data but for SAEs just one then this feels like an unfair comparison.
Nice post!
I agree that an important goal of MELBO is to elicit “complex many-token behaviors” (this is a current priority of mine).
You may want to check out my recent results on eliciting password-locked performance on DeepSeek-Math-7B. Using my new training algorithm for finding MELBO vectors, it’s possible to find a vector which increases MATH performance from 3% to 23% on password-locked MATH.
The new algorithm is much more efficient than the sequential training procedure from the original post, to the point that I’m currently bottlenecked by inference (even when using vllm). It also comes with a decent heuristic procedure for choosing R. In my experience, if you have a data-set of at least n≥10 prompts, it’s pretty good at finding a bunch of vectors corresponding to the most “salient” concept in the data-set (e.g. “jailbreak vectors” when trained on harmful instructions, or “capability-increasing vectors” when trained on password-locked models).