I think you’re missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your “follow-the-trying” approach.
While all of these are often analyzed from the perspective of “suppose you have a potentially-misaligned powerful AI; here’s what would happen”, I view that as an analysis tool, not the primary theory of change.
The theory of change that I most buy is that as you are training your model, while it is developing the “trying”, you would like it to develop good “trying” and not bad “trying”, and one way to make this more likely is to notice when bad “trying” develops and penalize it if so, with the hope that this leads to good “trying”.
This is illustrated in the theory-of-change diagram below, where to put it in your terminology:
Each of the clouds (red or blue) consists of models that are “trying”
The grey models outside of clouds are models that are not “trying” or are developing “trying”
The “deception rewarded” point occurs when a model that is developing “trying” does something bad due to instrumental / deceptive reasoning
The “apply alignment technique” means that you use debate / RRM / ELK instead of vanilla RLHF, which allows you to notice it doing something bad and penalize it instead of rewarding it.
Some potential objections + responses:
But the model will be “trying” right after pretraining, before you’ve even done any finetuning!
Response: I don’t think this is obvious, but if that is the case, that just means you should also be doing alignment work during pretraining.
But all of these techniques are considering models that already have all their concepts baked in, rather than developing them on the fly!
Response: I agree that’s what we’re thinking about now, and I agree that eventually we will need to think about models that develop concepts on the fly. But I think the overall theory of change here would still apply in that setting, even if we need to somewhat change the techniques to accommodate this new kind of capability.
Thanks, that helps! You’re working under a different development model than me, but that’s fine.
It seems to me that the real key ingredient in this story is where you propose to update the model based on motivation and not just behavior—“penalize it instead of rewarding it” if the outputs are “due to instrumental / deceptive reasoning”. That’s great. Definitely what we want to do. I want to zoom in on that part.
You write that “debate / RRM / ELK” are supposed to “allow you to notice” instrumental / deceptive reasoning. Of these three, I buy the ELK story—ELK is sorta an interpretability technique, so it seems plausible that ELK is relevant to noticing deceptive motivations (even if the ELK literature is not really talking about that too much at this stage, per Paul’s comment). But what about debate & RRM? I’m more confused about why you brought those up in this context. Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something?
(We can give the debaters / the reward model a printout of model activations alongside the model’s behavioral outputs. But I’m not sure what the next step of the story is, after that. How do the debaters / reward model learn to skillfully interpret the model activations to extract underlying motivations?)
Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something?
It’s confusing to me as well, perhaps because different people (or even the same person at different times) emphasize different things within the same approach, but here’s one post where someone said, “It is important that the overseer both knows which action the distilled AI wants to take as well as why it takes that action.”
I’m not claiming that you figure out whether the model’s underlying motivations are bad. (Or, reading back what I wrote, I did say that but it’s not what I meant, sorry about that.) I’m saying that when the model’s underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It’s plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it’s also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.
If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It’s plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it’s also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.
but, but, standard counterargument imperfect proxies Goodharting magnification of error adversarial amplification etc etc etc?
(It feels weird that this is a point that seems to consistently come up in discussions of this type, considering how basic of a disagreement it really is, but it really does seem to me like lots of things come back to this over and over again?)
Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.
(Yes, I’m aware that the arguments are more sophisticated than that and “previous examples of Goodharting didn’t lead to extinction” isn’t a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like “that’s a wildly strong conclusion you’ve drawn from a pretty handwavy and speculative argument”.)
Ultimately I think you just want to compare various kinds of models and ask how likely they are to arise (assuming you are staying within the scaled up neural networks as AGI paradigm). Some models you could consider:
The idealized aligned model, which does whatever it thinks is best for humans
The savvy aligned model, which wants to help humans but knows that it should play into human biases (e.g. by being sycophantic) in order to get high reward and not be selected against by gradient descent
The deceptively aligned model, which wants some misaligned goal (say paperclips), but knows that it should behave well until it can execute a treacherous turn
The bag of heuristics model, which (like a human) has a mix of various motivations, and mostly executes past strategies that have worked out well, imitating many of them from broader culture, without a great understanding of why they work, which tends to lead to high reward without extreme consequentialism.
(Really I think everything is going to be (4) until significantly past human-level, but will be on a spectrum of how close they are to (2) or (3).)
Plausibly you don’t get (1) because it doesn’t get particularly high reward relative to the others. But (2), (3) and (4) all seem like they could get similarly high reward. I think you could reasonably say that Goodharting is the reason you get (2), (3), or (4) rather than (1). But then amongst (2), (3) and (4), probably only (3) causes an existential catastrophe through misalignment.
You could then consider other factors like simplicity or training dynamics to say which of (2), (3) and (4) are likely to arise, but (a) this is no longer about Goodharting, (b) it seems incredibly hard to make arguments about simplicity / training dynamics that I’d actually trust, (c) the arguments often push in opposite directions (e.g. shard theory vs how likely is deceptive alignment), (d) a lot of these arguments also depend on capability levels, which introduces another variable into the mix (now allowing for arguments like this one).
The argument I’m making above is one about training dynamics. Specifically, the claim is that if you are on a path towards (3), it will probably take some bad actions initially (attempts at deception that fail), and if you successfully penalize those, that would plausibly switch the model towards (2) or (4).
I think you’re missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your “follow-the-trying” approach.
While all of these are often analyzed from the perspective of “suppose you have a potentially-misaligned powerful AI; here’s what would happen”, I view that as an analysis tool, not the primary theory of change.
The theory of change that I most buy is that as you are training your model, while it is developing the “trying”, you would like it to develop good “trying” and not bad “trying”, and one way to make this more likely is to notice when bad “trying” develops and penalize it if so, with the hope that this leads to good “trying”.
This is illustrated in the theory-of-change diagram below, where to put it in your terminology:
Each of the clouds (red or blue) consists of models that are “trying”
The grey models outside of clouds are models that are not “trying” or are developing “trying”
The “deception rewarded” point occurs when a model that is developing “trying” does something bad due to instrumental / deceptive reasoning
The “apply alignment technique” means that you use debate / RRM / ELK instead of vanilla RLHF, which allows you to notice it doing something bad and penalize it instead of rewarding it.
Some potential objections + responses:
But the model will be “trying” right after pretraining, before you’ve even done any finetuning!
Response: I don’t think this is obvious, but if that is the case, that just means you should also be doing alignment work during pretraining.
But all of these techniques are considering models that already have all their concepts baked in, rather than developing them on the fly!
Response: I agree that’s what we’re thinking about now, and I agree that eventually we will need to think about models that develop concepts on the fly. But I think the overall theory of change here would still apply in that setting, even if we need to somewhat change the techniques to accommodate this new kind of capability.
Thanks, that helps! You’re working under a different development model than me, but that’s fine.
It seems to me that the real key ingredient in this story is where you propose to update the model based on motivation and not just behavior—“penalize it instead of rewarding it” if the outputs are “due to instrumental / deceptive reasoning”. That’s great. Definitely what we want to do. I want to zoom in on that part.
You write that “debate / RRM / ELK” are supposed to “allow you to notice” instrumental / deceptive reasoning. Of these three, I buy the ELK story—ELK is sorta an interpretability technique, so it seems plausible that ELK is relevant to noticing deceptive motivations (even if the ELK literature is not really talking about that too much at this stage, per Paul’s comment). But what about debate & RRM? I’m more confused about why you brought those up in this context. Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something?
(We can give the debaters / the reward model a printout of model activations alongside the model’s behavioral outputs. But I’m not sure what the next step of the story is, after that. How do the debaters / reward model learn to skillfully interpret the model activations to extract underlying motivations?)
It’s confusing to me as well, perhaps because different people (or even the same person at different times) emphasize different things within the same approach, but here’s one post where someone said, “It is important that the overseer both knows which action the distilled AI wants to take as well as why it takes that action.”
I’m not claiming that you figure out whether the model’s underlying motivations are bad. (Or, reading back what I wrote, I did say that but it’s not what I meant, sorry about that.) I’m saying that when the model’s underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It’s plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it’s also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.
but, but, standard counterargument imperfect proxies Goodharting magnification of error adversarial amplification etc etc etc?
(It feels weird that this is a point that seems to consistently come up in discussions of this type, considering how basic of a disagreement it really is, but it really does seem to me like lots of things come back to this over and over again?)
Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.
(Yes, I’m aware that the arguments are more sophisticated than that and “previous examples of Goodharting didn’t lead to extinction” isn’t a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like “that’s a wildly strong conclusion you’ve drawn from a pretty handwavy and speculative argument”.)
Ultimately I think you just want to compare various kinds of models and ask how likely they are to arise (assuming you are staying within the scaled up neural networks as AGI paradigm). Some models you could consider:
The idealized aligned model, which does whatever it thinks is best for humans
The savvy aligned model, which wants to help humans but knows that it should play into human biases (e.g. by being sycophantic) in order to get high reward and not be selected against by gradient descent
The deceptively aligned model, which wants some misaligned goal (say paperclips), but knows that it should behave well until it can execute a treacherous turn
The bag of heuristics model, which (like a human) has a mix of various motivations, and mostly executes past strategies that have worked out well, imitating many of them from broader culture, without a great understanding of why they work, which tends to lead to high reward without extreme consequentialism.
(Really I think everything is going to be (4) until significantly past human-level, but will be on a spectrum of how close they are to (2) or (3).)
Plausibly you don’t get (1) because it doesn’t get particularly high reward relative to the others. But (2), (3) and (4) all seem like they could get similarly high reward. I think you could reasonably say that Goodharting is the reason you get (2), (3), or (4) rather than (1). But then amongst (2), (3) and (4), probably only (3) causes an existential catastrophe through misalignment.
You could then consider other factors like simplicity or training dynamics to say which of (2), (3) and (4) are likely to arise, but (a) this is no longer about Goodharting, (b) it seems incredibly hard to make arguments about simplicity / training dynamics that I’d actually trust, (c) the arguments often push in opposite directions (e.g. shard theory vs how likely is deceptive alignment), (d) a lot of these arguments also depend on capability levels, which introduces another variable into the mix (now allowing for arguments like this one).
The argument I’m making above is one about training dynamics. Specifically, the claim is that if you are on a path towards (3), it will probably take some bad actions initially (attempts at deception that fail), and if you successfully penalize those, that would plausibly switch the model towards (2) or (4).