Claude 3.7′s annoying personality is the first example of accidentally misaligned AI making my life worse. Claude 3.5/3.6 was renowned for its superior personality that made it more pleasant to interact with than ChatGPT.
3.7 has an annoying tendency to do what it thinks you should do, rather than following instructions. I’ve run into this frequently in two coding scenarios:
In Cursor, I ask it to implement some function in a particular file. Even when explicitly instructed not to, it guesses what I want to do next and changes other parts of the code as well.
I’m trying to fix part of my code and I ask it to diagnose a problem and suggest debugging steps. Even when explicitly instructed not to, it will suggest alternative approaches that circumvent the issue, rather than trying to fix the current approach.
I call this misalignment, rather than a capabilities failure, because it seems a step back from previous models and I suspect it is a side effect of training the model to be good at autonomous coding tasks, which may be overriding its compliance with instructions.
DeepMind says boo SAEs, now Anthropic says yay SAEs![1]
Reading this paper pushed me a fair amount in the yay direction. We may still be at the unsatisfying level where we can only say “this cluster of features seems to roughly correlate with this type of thing” and “the interaction between this cluster and this cluster seems to mostly explain this loose group of behaviors”. But it looks like we’re actually pointing at real things in the model. And therefore we are beginning to be able to decompose the computation of LLMs in meaningful ways. The Addition Case Study is seriously cool and feels like a true insight into the model’s internal algorithms.
Maybe we will further decompose these explanations until we can get down to satisfying low-level descriptions like “this mathematical object is computed by this function and is used in this algorithm”. Even if we could still interpret circuits at this level of abstraction, humans probably couldn’t hold in their heads all the relevant parts of a single forward pass at once. But AIs could or maybe that won’t be required for useful applications.
The prominent error terms and simplifying assumptions are worrying, but maybe throwing enough compute and hill-climbing research at the problem will eventually shrink them to acceptable sizes. It’s notable that this paper contains very few novel conceptual ideas and is mostly just a triumph of engineering schlep, massive compute and painstaking manual analysis.
This is obviously a straw man of both sides. They seem to be thinking about it from pretty different perspectives. DeepMind is roughly judging them by their immediate usefulness in applications, while Anthropic is looking at them as a stepping stone towards ambitious moonshot interp.