I would guess that models plan in this style much more generally. It’s just useful in so many contexts. For instance, if you’re trying to choose what article goes in front of a word, and that word is fixed by other constraints, you need a plan of what that word is (“an astronomer” not “a astronomer”). Or you might be writing code and have to know the type of the return value of a function before you’ve written the body of the function, since Python type annotations come at the start of the function in the signature. Etc. This sort of thing just comes up all over the place.
Adam Jermyn
It’s not so much that we didn’t think models plan ahead in general, as that we had various hypotheses (including “unknown unknowns”) and this kind of planning in poetry wasn’t obviously the best one until we saw the evidence.
[More generally: in Interpretability we often have the experience of being surprised by the specific mechanism a model is using, even though with the benefit of hindsight it seems obvious. E.g. when we did the work for Towards Monosemanticity we were initially quite surprised to see the “the in <context>” features, thought they were indicative of a bug in our setup, and had to spend a while thinking about them and poking around before we realized why the model wanted them (which now feels obvious).]
Tracing the Thoughts of a Large Language Model
Auditing language models for hidden objectives
I can also confirm (I have a 3:1 match).
Unless we build more land (either in the ocean or in space)?
There is Dario’s written testimony before Congress, which mentions existential risk as a serious possibility: https://www.judiciary.senate.gov/imo/media/doc/2023-07-26_-_testimony_-_amodei.pdf
He also signed the CAIS statement on x-risk: https://www.safe.ai/work/statement-on-ai-risk
He does start out by saying he thinks & worries a lot about the risks (first paragraph):
I think and talk a lot about the risks of powerful AI. The company I’m the CEO of, Anthropic, does a lot of research on how to reduce these risks… I think that most people are underestimating just how radical the upside of AI could be, just as I think most people are underestimating how bad the risks could be.
He then explains (second paragraph) that the essay is meant to sketch out what things could look like if things go well:
In this essay I try to sketch out what that upside might look like—what a world with powerful AI might look like if everything goes right.
I think this is a coherent thing to do?
I get 1e7 using 16 bit-flips per bfloat16 operation, 300K operating temperature, and 312Tflop/s (from Nvidia’s spec sheet). My guess is that this is a little high because a float multiplication involves more operations than just flipping 16 bits, but it’s the right order-of-magnitude.
Another objection is that you can minimize the wrong cost function. Making “cost” go to zero could mean making “the thing we actually care about” go to (negative huge number).
One day a mathematician doesn’t know a thing. The next day they do. In between they made no observations with their senses of the world.
It’s possible to make progress through theoretical reasoning. It’s not my preferred approach to the problem (I work on a heavily empirical team at a heavily empirical lab) but it’s not an invalid approach.
I’m guessing that the sales numbers aren’t high enough to make $200k if sold at plausible markups?
In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.
This matches my impression. At EAG London I was really stunned (and heartened!) at how many skilled people are pivoting into interpretability from non-alignment fields.
Second, the measure of “features per dimension” used by Elhage et al. (2022) might be misleading. See the paper for details of how they arrived at this quantity. But as shown in the figure above, “features per dimension” is defined as the Frobenius norm of the weight matrix before the layer divided by the number of neurons in the layer. But there is a simple sanity check that this doesn’t pass. In the case of a ReLU network without bias terms, multiplying a weight matrix by a constant factor will cause the “features per dimension” to be increased by that factor squared while leaving the activations in the forward pass unchanged up to linearity until a non-ReLU operation (like a softmax) is performed. And since each component of a softmax’s output is strictly increasing in that component of the input, scaling weight matrices will not affect the classification.
It’s worth noting that Elhage+2022 studied an autoencoder with tied weights and no softmax, so there isn’t actually freedom to rescale the weight matrix without affecting the loss in their model, making the scale of the weights meaningful. I agree that this measure doesn’t generalize to other models/tasks though.
They also define a more fine-grained measure (the dimensionality of each individual feature) in a way that is scale-invariant and which broadly agrees with their coarser measure...
As long as you make it clear at the header that it’s your unofficial translation, go for it!