This is definitely a real concern, and one that I do spend time thinking about. Here are some thoughts:
First, if you want to understand what a model M is doing, it’s definitely going to be necessary to have access to the history/provenance of how/why each different component of the model came to be during the training process. This shouldn’t be that difficult in an informed oversight setting if you do something like the following, though. Imagine some sequence of Mi and suppose the minimum number of steps for an attack of this form to be introduced is r. Then, at step i+r−1 we can implement oversight by calling Amp(Mi) to inspect the whole sequence Mi+1,Mi+2,…,Mi+r−1 and look for any attacks being developed and then retrain from right before that point in such a way that fixes them.
Second, however, it does still have to be the case that the overseer isn’t too exploitable. If there is some strategy available to M that causes Amp(M) to strongly believe that M is acceptable that is easy to reach without going through anything that looks unacceptable, then this won’t work. However, there are reasons to believe that this condition might not be that hard to obtain. First, it’s fine if Amp(M) is just uncertain about whether M is acceptable; we can just retrain the model to be more transparent in that case and be alright. Second, many of the natural sorts of attacks in this class are of the form where they exploit some knowledge that M has but that its overseer doesn’t, but in the informed oversight setup, Amp(M) should know everything that M knows, preventing those sorts of attacks from working.
Third, I expect the training process to be quite a lot dumber than the model such that I expect the most concerning attacks on the overseer to come from the model and not the training process. It is conceivable, however, that the model could “hack” its own gradient so as to get the training process to modify it in a particular way, making an attack from the model look like it was coming from the training process. In this case, I would hope to prevent attacks of this form by ensuring that the model is properly myopic such that it isn’t considering plans of this form.
Finally, if you’re concerned about attacks of this form, you can modify your training process to make them less likely. In particular, in the standard setup that I generally think about, oversight is included as part of the model’s loss function such that the model is incentivized to make itself transparent to the overseer. However, there are many different ways you can do this. Rather than have oversight as a penalty term, for example, you could have it as a constraint, which could help avoid Goodharting on the overseer (this is similar in spirit to your suggestion of using a quantilizer). Alternatively, you could not include oversight as part of the loss function at all and instead rely on the overseer manually fixing unacceptable models or retraining until an acceptable model is found, though I expect those sorts of approaches to not be very competitive.
Thanks for the reply. Good to know that it’s on someone’s radar already. I have some follow-up questions, which you don’t have to all answer right away, since obviously this is an ongoing research project.
Imagine some sequence of Mi and suppose the minimum number of steps for an attack of this form to be introduced is r.
This part confuses me because intuitively it seems like if r is larger that would make the problem easier, but in your proposal Amp(Mi) would need to inspect a longer sequence of models.
Also, roughly how many calls to Mi do you usually imagine Amp(Mi) as making? What does “look for any attacks” actually consist of here? How can you be sure that whatever Amp(Mi) is doing will catch (or at least become suspicious/uncertain about) all possible attacks (and not have so many false positives that competitiveness is significantly impacted)?
many of the natural sorts of attacks in this class are of the form where they exploit some knowledge that
M has but that its overseer doesn’t
Ruling out one class of attacks doesn’t seem to help much… To make a strong argument it seems like you’d need to exhaustively list all possible classes of attack and show how the overseer can catch all of them.
Third, I expect the training process to be quite a lot dumber than the model such that I expect the most concerning attacks on the overseer to come from the model and not the training process.
The training process could be dumb but have access to a lot more compute and end up optimizing harder than the model would. It would take a lot longer to develop each attack but eventually find all possible attacks given enough compute.
Future advances in ML/RL could make the training process smarter and the whole point of IDA is to take advantage of such advances to keep it competitive, so you can’t assume that the training process is dumb.
At the lower levels of IDA both the model and the overseer are pretty dumb, so even a moderately smart training process could be smarter.
(These arguments are really handwavy but hopefully gets my point/concern across.)
Rather than have oversight as a penalty term, for example, you could have it as a constraint, which could help avoid Goodharting on the overseer
Not sure what you have in mind here. If oversight is a constraint, what is providing the reward?
this is similar in spirit to your suggestion of using a quantilizer
Seems like with a quantilizer you’d take a competitiveness hit because the agent would no longer find the most optimal plans (that a comparable RL agent could find). How conservative do we need to be here, to ensure safety? How to analyze the competitiveness tradeoff, to make sure the competitiveness hit isn’t too big?
This part confuses me because intuitively it seems like if r is larger that would make the problem easier, but in your proposal Amp(Mi) would need to inspect a longer sequence of models.
You can avoid having to inspect the whole sequence by doing binary search to find the last model for which oversight works.
Also, roughly how many calls to Mi do you usually imagine Amp(Mi) as making? What does “look for any attacks” actually consist of here? How can you be sure that whatever Amp(Mi) is doing will catch (or at least become suspicious/uncertain about) all possible attacks (and not have so many false positives that competitiveness is significantly impacted)?
Well, that’s the hard part; this is still a very open question. I’m currently working on a large post summarizing my thoughts on this question, though, so rather than try to start laying out my thoughts here I think I’ll just defer until that’s out, which should be fairly soon.
Ruling out one class of attacks doesn’t seem to help much… To make a strong argument it seems like you’d need to exhaustively list all possible classes of attack and show how the overseer can catch all of them.
Yep. I’m optimistic that this will be possible if we can get a better understanding of what transparent optimization looks like, however, since then Amp(M) can pursue the strategy of only being certain of the acceptability of models it can verify are in fact being transparent to it.
The training process could be dumb but have access to a lot more compute and end up optimizing harder than the model would. It would take a lot longer to develop each attack but eventually find all possible attacks given enough compute.
True, but the nice thing about that is that it should be slow and continuous such that if you’re doing oversight during training according to a scheme like the one I described then you should be able to catch it.
Future advances in ML/RL could make the training process smarter and the whole point of IDA is to take advantage of such advances to keep it competitive, so you can’t assume that the training process is dumb.
Well, maybe dumb was the wrong word there. I agree that it’s bad to assume that the training process is dumb, but I think it’s fine to assume that it’ll be slow and continuous such that there’s always some previous model such that an amplified version of it can oversee the current model.
At the lower levels of IDA both the model and the overseer are pretty dumb, so even a moderately smart training process could be smarter.
Yeah, this is where you’re relying on the human more. For early models, I think you’re mostly relying on the human having access to good enough interpretability tools that they can understand simple models without help.
Not sure what you have in mind here. If oversight is a constraint, what is providing the reward?
Ah, this might be a big difference in the way we’re thinking about this problem. I see informed oversight as an inner alignment technique which you can do if you are training your model via the outer alignment technique of amplification. Specifically:
Let M be the model and Amp be your amplification operation, which in the pure supervised case just means letting a human answer the question given access to the model. Define train to be some procedure for training M to minimize
LM=Ex∈X[||M(x)−Amp(M)(x)||].
Then, let Mi+1=train(Mi) and do informed oversight by letting Amp(Mi) inspect Mi,…,Mi+r−1 as above. In this picture, there are two distinct training signals: the supervised training signal through train which is telling M to approximate Amp(M), and the oversight training signal, which is trying to get M to be transparent and to verify that it is in fact doing the right thing.
(In practice, you’ll probably want to incorporate oversight into the loss function so you get a better gradient at each step, though that’s difficult because oversight is likely to be very expensive. However, you could do something like train an approximation to oversight that you use in your loss function in-between actual oversight calls. Regardless, I think this is mostly an implementation detail and generally focus on the above picture.)
Seems like with a quantilizer you’d take a competitiveness hit because the agent would no longer find the most optimal plans (that a comparable RL agent could find). How conservative do we need to be here, to ensure safety? How to analyze the competitiveness tradeoff, to make sure the competitiveness hit isn’t too big?
I think this is related to what I just said above. The way I was thinking about this was that you could train your model to maximize its predictive accuracy of Amp(M)’s behavior subject to the constraint that it always be transparently acceptable to Amp(M) during oversight. Thus, competitiveness is mostly going to be a function of 1) how competitive HCH is, since that’s the thing you’re approximating, and 2) how much overhead oversight adds.
I agree that it’s bad to assume that the training process is dumb, but I think it’s fine to assume that it’ll be slow and continuous such that there’s always some previous model such that an amplified version of it can oversee the current model.
Why do you think it’s fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you’re working on?)
Yeah, this is where you’re relying on the human more. For early models, I think you’re mostly relying on the human having access to good enough interpretability tools that they can understand simple models without help.
in the pure supervised case just means letting a human answer the question given access to the model
This looks like a big disconnect between us. The thing that touched off this discussion was Ought’s switch from Factored Cognition to Factored Evaluation, and Rohin’s explanation: “In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal.”
I think if we’re using SL for the question answering part and only using RL for “oversight” (“trying to get M to be transparent and to verify that it is in fact doing the right thing”) then I’m a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought’s plan is to use RL for both. In that case it doesn’t help much to make “oversight” a constraint since the security / reward gaming problem in the “answer evaluation” part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the “answer evaluation” part that could be exploited.
Why do you think it’s fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you’re working on?)
I think I would be fairly surprised if future ML techniques weren’t smooth in this way, so I think it’s a pretty reasonable assumption.
This seems to be assuming High Bandwidth Overseer. What about LBO?
A low-bandwidth overseer seems unlikely to be competitive to me. Though it’d be nice if it worked, I think you’ll probably want to solve the problem of weird hacky inputs via something like filtered-HCH instead. That being said, I expect the human to drop out of the process fairly quickly—it’s mostly only useful in the beginning before the model learns how to do decompositions properly—at some point you’ll want to switch to implementing Amp(M) as M consulting M rather than H consulting M.
This looks like a big disconnect between us. The thing that touched off this discussion was Ought’s switch from Factored Cognition to Factored Evaluation, and Rohin’s explanation: “In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal.”
I think if we’re using SL for the question answering part and only using RL for “oversight” (“trying to get M to be transparent and to verify that it is in fact doing the right thing”) then I’m a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought’s plan is to use RL for both. In that case it doesn’t help much to make “oversight” a constraint since the security / reward gaming problem in the “answer evaluation” part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the “answer evaluation” part that could be exploited.
I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting. That being said, you definitely can still make oversight a constraint even if you’re optimizing an RL signal and I do think it helps, since it gives you a way to separately verify that the system is actually being transparent. The idea in this sort of a setting would be that, if your system achieves high performance on the RL signal, then it must be outputting answers which the amplified human likes—but then the concern is that it might be tricking the human or something. But then if you can use oversight to look inside the model and verify that it’s actually being transparent, then you can rule out that possibility. By making the transparency part a constraint rather than an objective, it might help prevent the model from gaming the transparency part, which I expect to be the most important part and in turn could help you detect if there was any gaming going on of the RL signal.
For the record, though, I don’t currently think that making the transparency part a constraint is a good idea. First, because I expect transparency to be hard enough that you’ll want to be able to benefit from having a strong gradient towards it. And second, because I don’t think it actually helps prevent gaming very much: even if your training process doesn’t explicitly incentivize gaming, I expect that by default many mesa-optimizers will have objectives that benefit from it. Thus, what you really want is a general solution for preventing your mesa-optimizer from ever doing anything like that, which I expect to be something like corrigibility or myopia, rather than just trying to rely on your training process not incentivizing it.
I think I would be fairly surprised if future ML techniques weren’t smooth in this way, so I think it’s a pretty reasonable assumption.
This is kind of tangential at this point, but I’m not so sure about this. Humans can sometimes optimize things without being slow and continuous, so there must be algorithms that can do this, which can be invented or itself produced via (dumber) ML. As another intuition pump, suppose the algorithm is just gradient descent with some added pattern recognizers that can say “hey, I see where this is going, let’s jump directly there.”
A low-bandwidth overseer seems unlikely to be competitive to me.
Has this been written up anywhere, and is it something that Paul agrees with? (I think last time he talked about HBO vs LBO, he was still 50⁄50 on them.)
I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting.
Ah ok, so your earlier comments were addressing a different and easier problem than the one I have in mind, and that’s (mostly) why you sounded more optimistic than me.
Do you have a good sense of why Paul disagrees with you, and if so can you explain?
(I haven’t digested your mechanistic corrigibility post yet. May have more to say after I do that.)
This is definitely a real concern, and one that I do spend time thinking about. Here are some thoughts:
First, if you want to understand what a model M is doing, it’s definitely going to be necessary to have access to the history/provenance of how/why each different component of the model came to be during the training process. This shouldn’t be that difficult in an informed oversight setting if you do something like the following, though. Imagine some sequence of Mi and suppose the minimum number of steps for an attack of this form to be introduced is r. Then, at step i+r−1 we can implement oversight by calling Amp(Mi) to inspect the whole sequence Mi+1, Mi+2, …, Mi+r−1 and look for any attacks being developed and then retrain from right before that point in such a way that fixes them.
Second, however, it does still have to be the case that the overseer isn’t too exploitable. If there is some strategy available to M that causes Amp(M) to strongly believe that M is acceptable that is easy to reach without going through anything that looks unacceptable, then this won’t work. However, there are reasons to believe that this condition might not be that hard to obtain. First, it’s fine if Amp(M) is just uncertain about whether M is acceptable; we can just retrain the model to be more transparent in that case and be alright. Second, many of the natural sorts of attacks in this class are of the form where they exploit some knowledge that M has but that its overseer doesn’t, but in the informed oversight setup, Amp(M) should know everything that M knows, preventing those sorts of attacks from working.
Third, I expect the training process to be quite a lot dumber than the model such that I expect the most concerning attacks on the overseer to come from the model and not the training process. It is conceivable, however, that the model could “hack” its own gradient so as to get the training process to modify it in a particular way, making an attack from the model look like it was coming from the training process. In this case, I would hope to prevent attacks of this form by ensuring that the model is properly myopic such that it isn’t considering plans of this form.
Finally, if you’re concerned about attacks of this form, you can modify your training process to make them less likely. In particular, in the standard setup that I generally think about, oversight is included as part of the model’s loss function such that the model is incentivized to make itself transparent to the overseer. However, there are many different ways you can do this. Rather than have oversight as a penalty term, for example, you could have it as a constraint, which could help avoid Goodharting on the overseer (this is similar in spirit to your suggestion of using a quantilizer). Alternatively, you could not include oversight as part of the loss function at all and instead rely on the overseer manually fixing unacceptable models or retraining until an acceptable model is found, though I expect those sorts of approaches to not be very competitive.
Thanks for the reply. Good to know that it’s on someone’s radar already. I have some follow-up questions, which you don’t have to all answer right away, since obviously this is an ongoing research project.
Ah, this is a useful connection to make. BTW I still have an unanswered question about provenance.
This part confuses me because intuitively it seems like if r is larger that would make the problem easier, but in your proposal Amp(Mi) would need to inspect a longer sequence of models.
Also, roughly how many calls to Mi do you usually imagine Amp(Mi) as making? What does “look for any attacks” actually consist of here? How can you be sure that whatever Amp(Mi) is doing will catch (or at least become suspicious/uncertain about) all possible attacks (and not have so many false positives that competitiveness is significantly impacted)?
Ruling out one class of attacks doesn’t seem to help much… To make a strong argument it seems like you’d need to exhaustively list all possible classes of attack and show how the overseer can catch all of them.
The training process could be dumb but have access to a lot more compute and end up optimizing harder than the model would. It would take a lot longer to develop each attack but eventually find all possible attacks given enough compute.
Future advances in ML/RL could make the training process smarter and the whole point of IDA is to take advantage of such advances to keep it competitive, so you can’t assume that the training process is dumb.
At the lower levels of IDA both the model and the overseer are pretty dumb, so even a moderately smart training process could be smarter.
(These arguments are really handwavy but hopefully gets my point/concern across.)
Not sure what you have in mind here. If oversight is a constraint, what is providing the reward?
Seems like with a quantilizer you’d take a competitiveness hit because the agent would no longer find the most optimal plans (that a comparable RL agent could find). How conservative do we need to be here, to ensure safety? How to analyze the competitiveness tradeoff, to make sure the competitiveness hit isn’t too big?
You can avoid having to inspect the whole sequence by doing binary search to find the last model for which oversight works.
Well, that’s the hard part; this is still a very open question. I’m currently working on a large post summarizing my thoughts on this question, though, so rather than try to start laying out my thoughts here I think I’ll just defer until that’s out, which should be fairly soon.
Yep. I’m optimistic that this will be possible if we can get a better understanding of what transparent optimization looks like, however, since then Amp(M) can pursue the strategy of only being certain of the acceptability of models it can verify are in fact being transparent to it.
True, but the nice thing about that is that it should be slow and continuous such that if you’re doing oversight during training according to a scheme like the one I described then you should be able to catch it.
Well, maybe dumb was the wrong word there. I agree that it’s bad to assume that the training process is dumb, but I think it’s fine to assume that it’ll be slow and continuous such that there’s always some previous model such that an amplified version of it can oversee the current model.
Yeah, this is where you’re relying on the human more. For early models, I think you’re mostly relying on the human having access to good enough interpretability tools that they can understand simple models without help.
Ah, this might be a big difference in the way we’re thinking about this problem. I see informed oversight as an inner alignment technique which you can do if you are training your model via the outer alignment technique of amplification. Specifically:
Let M be the model and Amp be your amplification operation, which in the pure supervised case just means letting a human answer the question given access to the model. Define train to be some procedure for training M to minimize LM=Ex∈X[||M(x)−Amp(M)(x)||]. Then, let Mi+1=train(Mi) and do informed oversight by letting Amp(Mi) inspect Mi,…,Mi+r−1 as above. In this picture, there are two distinct training signals: the supervised training signal through train which is telling M to approximate Amp(M), and the oversight training signal, which is trying to get M to be transparent and to verify that it is in fact doing the right thing.
(In practice, you’ll probably want to incorporate oversight into the loss function so you get a better gradient at each step, though that’s difficult because oversight is likely to be very expensive. However, you could do something like train an approximation to oversight that you use in your loss function in-between actual oversight calls. Regardless, I think this is mostly an implementation detail and generally focus on the above picture.)
I think this is related to what I just said above. The way I was thinking about this was that you could train your model to maximize its predictive accuracy of Amp(M)’s behavior subject to the constraint that it always be transparently acceptable to Amp(M) during oversight. Thus, competitiveness is mostly going to be a function of 1) how competitive HCH is, since that’s the thing you’re approximating, and 2) how much overhead oversight adds.
Why do you think it’s fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you’re working on?)
This seems to be assuming High Bandwidth Overseer. What about LBO?
This looks like a big disconnect between us. The thing that touched off this discussion was Ought’s switch from Factored Cognition to Factored Evaluation, and Rohin’s explanation: “In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal.”
I think if we’re using SL for the question answering part and only using RL for “oversight” (“trying to get M to be transparent and to verify that it is in fact doing the right thing”) then I’m a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought’s plan is to use RL for both. In that case it doesn’t help much to make “oversight” a constraint since the security / reward gaming problem in the “answer evaluation” part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the “answer evaluation” part that could be exploited.
I think I would be fairly surprised if future ML techniques weren’t smooth in this way, so I think it’s a pretty reasonable assumption.
A low-bandwidth overseer seems unlikely to be competitive to me. Though it’d be nice if it worked, I think you’ll probably want to solve the problem of weird hacky inputs via something like filtered-HCH instead. That being said, I expect the human to drop out of the process fairly quickly—it’s mostly only useful in the beginning before the model learns how to do decompositions properly—at some point you’ll want to switch to implementing Amp(M) as M consulting M rather than H consulting M.
I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting. That being said, you definitely can still make oversight a constraint even if you’re optimizing an RL signal and I do think it helps, since it gives you a way to separately verify that the system is actually being transparent. The idea in this sort of a setting would be that, if your system achieves high performance on the RL signal, then it must be outputting answers which the amplified human likes—but then the concern is that it might be tricking the human or something. But then if you can use oversight to look inside the model and verify that it’s actually being transparent, then you can rule out that possibility. By making the transparency part a constraint rather than an objective, it might help prevent the model from gaming the transparency part, which I expect to be the most important part and in turn could help you detect if there was any gaming going on of the RL signal.
For the record, though, I don’t currently think that making the transparency part a constraint is a good idea. First, because I expect transparency to be hard enough that you’ll want to be able to benefit from having a strong gradient towards it. And second, because I don’t think it actually helps prevent gaming very much: even if your training process doesn’t explicitly incentivize gaming, I expect that by default many mesa-optimizers will have objectives that benefit from it. Thus, what you really want is a general solution for preventing your mesa-optimizer from ever doing anything like that, which I expect to be something like corrigibility or myopia, rather than just trying to rely on your training process not incentivizing it.
This is kind of tangential at this point, but I’m not so sure about this. Humans can sometimes optimize things without being slow and continuous, so there must be algorithms that can do this, which can be invented or itself produced via (dumber) ML. As another intuition pump, suppose the algorithm is just gradient descent with some added pattern recognizers that can say “hey, I see where this is going, let’s jump directly there.”
Has this been written up anywhere, and is it something that Paul agrees with? (I think last time he talked about HBO vs LBO, he was still 50⁄50 on them.)
Ah ok, so your earlier comments were addressing a different and easier problem than the one I have in mind, and that’s (mostly) why you sounded more optimistic than me.
Do you have a good sense of why Paul disagrees with you, and if so can you explain?
(I haven’t digested your mechanistic corrigibility post yet. May have more to say after I do that.)