If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Why train a helpful-only model?
If one of our key defenses against misuse of AI is good ol’ value alignment—building AIs that have some notion of what a “good purpose for them” is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.
I’m big on point #2 feeding into point #1.
“Alignment,” used in a way where current AI is aligned—a sort of “it does basically what we want, within its capabilities, with some occasional mistakes that don’t cause much harm” sort of alignment—is simply easier at lower capabilities, where humans can do a relatively good job of overseeing the AI, not just in deployment but also during training. Systematic flaws in human oversight during training leads (under current paradigms) to misaligned AI.
If you can tell me in math what that means, then you can probably make a system that does it. No guarantees on it being distinct from a more “boring” specification though.
Here’s my shot: You’re searching a game tree, and come to a state that has some X and some Y. You compute a “value of X” that’s the total discounted future “value of Y” you’ll get, conditional on your actual policy, relative to a counterfactual where you have some baseline level of X. And also you compute the “value of Y,” which is the same except it’s the (discounted, conditional, relative) expected total “value of X” you’ll get. You pick actions to steer towards a high sum of these values.
A lot of the effect is picking high-hanging fruit.
Like, go to phys rev D now. There’s clearly a lot of hard work still going on. But that hard work seems to be getting less result, because they’re doing things like carefully calculating the trailing-order terms of the muon’s magnetic moment to get a change many decimal places down. (It turns out that this might be important for studying physics beyond the Standard Model. So this is good and useful work, definitely not being literally stalled.)
Another chunk of the effect is that you generally don’t know what’s important now. In hindsight you can look back and see all these important bits of progress woven into a sensible narrative. But research that’s being done right now hasn’t had time to earn its place in such a narrative. Especially if you’re an outside observer who has to get the narrative of research third-hand.
In the infographic, are “Leading Chinese Lab” and “Best public model”s’ numbers swapped? The best public model is usually said to be ahead of the Chinese.
EDIT: Okay, maybe most of it before the endgame is just unintuitive predictions. In the endgame, when the categories “best OpenBrain model,” “best public model” and “best Chinese model” start to become obsolete, I think your numbers are weird for different reasons and maybe you should just set them all equal.
Scott Wolchok correctly calls out me but also everyone else for failure to make an actually good definitive existential risk explainer. It is a ton of work to do properly but definitely worth doing right.
Reminder that https://ui.stampy.ai/ exists
The idea is interesting, but I’m somewhat skeptical that it’ll pan out.
RG doesn’t help much going backwards—the same coarse-grained laws might correspond to many different micro-scale laws, especially when you don’t expect the micro scale to be simple.
Singular learning theory provides a nice picture of phase-transition-like-phenomena, but it’s likely that large neural networks undergo lots and lots of phase transitions, and that there’s not just going to be one phase transition from “naive” to “naughty” that we can model simply.
Conversely, lots of important changes might not show up as phase transitions.
Some AI architectures are basically designed to be hard to analyze with RG because they want to mix information from a wide variety of scales together. Full attention layers might be such an example.
If God has ordained some “true values,” and we’re just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants.
On the other hand, if we’re trying to find good generalizations of the way we use the notion of “our values” in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.
Thanks!
Any thoughts on how this line of research might lead to “positive” alignment properties? (i.e. Getting models to be better at doing good things in situations where what’s good is hard to learn / figure out, in contrast to a “negative” property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)
The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.
Very interesting!
It would be interesting to know what the original reward models would say here—does the “screaming” score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?
My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.
At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?
I’m a big fan! Any thoughs on how to incorporate different sorts of reflective data, e.g. different measures of how people think mediation “should” go?
I don’t get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).
Hm, yeah, I didn’t really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)
Anyhow, thanks for the reply. I have now seen last figure.
Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.
“Steganography” I think give the wrong picture of what I expect—it’s not that the model would be choosing a deliberately obscure way to encode secret information. It’s just that it’s going to use lots of degrees of freedom to try to get better results, often not what a human would do.
A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens. This is quite different from steganography because the tokens aren’t being used for semantic content, not even hidden content, they have a different mechanism of impact on the computation the AI does for future tokens.
But as with most things, there’s going to be a long tail of “unclean” examples—places where tokens have metacognitive functions that are mostly reasonable to a human reader, but are interpreted in a slightly new way. Some of these functions might be preserved or reinvented under finetuning on paraphrases, though only to the extent they’re useful for predicting the rest of the paraphrased CoT.
Well, I’m disappointed.
Everything about misuse risks and going faster to Beat China, nothing about accident/systematic risks. I guess “testing for national security capabilities” is probably in practice code for “some people will still be allowed to do AI alignment work,” but that’s not enough.
I really would have hoped Anthropic could be realistic and say “This might go wrong. Even if there’s no evil person out there trying to misuse AI, bad things could still happen by accident, in a way that needs to be fixed by changing what AI gets built in the first place, not just testing it afterwards. If this was like making a car, we should install seatbelts and maybe institute a speed limit.”
I think it’s about salience. If you “feel the AGI,” then you’ll automatically remember that transformative AI is a thing that’s probably going to happen, when relevant (e.g. when planning AI strategy, or when making 20-year plans for just about anything). If you don’t feel the AGI, then even if you’ll agree when reminded that transformative AI is a thing that’s probably going to happen, you don’t remember it by default, and you keep making plans (or publishing papers about the economic impacts of AI or whatever) that assume it won’t.
I agree that in some theoretical infinite-retries game (that doesn’t allow the AI to permanently convince the human of anything), scheming has a much longer half-life than “honest” misalignment. But I’d emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they’re a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don’t get to iterate as much as you’d like.
I have a lot of implicit disagreements.
Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.
This is because ethics isn’t science, it doesn’t “hit back” when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.
Defending against this kind of “sycophancy++” failure mode doesn’t look like defending against scheming. It looks like solving outer alignment really well.
Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn’t nearly as true.
I’m confused about how to parse this. One response is “great, maybe ‘alignment’—or specifically being a trustworthy assistant—is a coherent direction in activation space.”
Another is “shoot, maybe misalignment is convergent, it only takes a little bit of work to knock models into the misaligned basin, and it’s hard to get them back.” Waluigi effect type thinking.
My guess is neither of these.
If ‘aligned’ (i.e. performing the way humans want on the sorts of coding, question-answering, and conversational tasks you’d expect of a modern chatbot) behavior was all that fragile under finetuning, what I’d expect is not ‘evil’ behavior, but a reversion to next-token prediction.
(Actually, putting it that way raises an interesting question, of how big the updates were for the insecure finetuning set vs. the secure finetuning set. Their paper has the finetuning loss of the insecure set, but I can’t find the finetuning loss of the secure set—any authors know if the secure set caused smaller updates and therefore might just have perturbed the weights less?)
Anyhow, point is that what seems more likely to me is that it’s the misalignment / bad behavior that’s being demonstrated to be a coherent direction (at least on these on-distribution sorts of tasks), and it isn’t automatic but requires passing some threshold of finetuning power before you can make it stick.
I agree that trying to “jump straight to the end”—the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus—would be bad.
And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it’s eventually unsafe unless you make advancements in learning values in a way that’s good according to humans.