I may have asked this already somewhere, but do you know if there’s a notion of “outer aligned” that is applicable to oracles/predictors in general, as opposed to trying to approximate/predict HCH specifically? Basically the problem is that I don’t know what “aligned” or “trying to do what we want” could mean in the general case. Is “outer alignment” meant to be applicable in the general case?
This post talks about outer alignment of the loss function. Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we’d call that an “outer alignment failure”. Or would it make more sense to use different terminology for that?
Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.
So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?
Also, related to Ofer’s comment, can you clarify whether it’s intended for this definition that the loss function only looks at the model’s input/output behavior, or can it also take into account other information about the model?
HCH is just a bunch of humans after all and if you can instruct them not to do dumb things like instantiate arbitrary Turing machines
I believe the point about Turing machines was that given Low Bandwidth Overseer, it’s not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines. But other issues arise with HBO, as William Saunders wrote in the above linked post:
The reason for this system [LBO] being introduced is wanting to avoid security issues as the system scales. The fear is that there would be an “attack” on the system: an input that could be shown to an overseer that would cause the overseer to become corrupted and try to sabotage the system. This could be some kind of misleading philosophical argument, some form of blackmail, a human adversarial example, etc. If an input like this exists, then as soon as the first agent is corrupted, it can try to spread the attack to other agents. The first agent could be corrupted either by chance, or through an attack being included in the input.
I understand you don’t want to go into details about whether theoretical HCH is aligned or not here, but I still want to flag that “instruct them not to do dumb things like instantiate arbitrary Turing machines” seems rather misleading. I’m also curious whether you have HBO or LBO in mind for this post.
Another thing that maybe I didn’t make clear previously:
I believe the point about Turing machines was that given Low Bandwidth Overseer, it’s not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines.
I agree, but if you’re instructing your humans not to instantiate arbitrary Turing machines, then that’s a competitiveness claim, not an alignment claim. I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn’t be aligned.
I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn’t be aligned.
But if you put aside competitiveness, can’t HCH be trivially aligned? E.g., you give the humans making up HCH instructions to cause it to not be able to answer anything except simple arithmetic questions. So it seems that a claim of HCH being aligned is meaningless unless the claim is about being aligned at some level of competitiveness.
That’s a good point. What I really mean is that I think the sort of HCH that you get out of taking actual humans and giving them careful instructions is more likely to be uncompetitive than it is to be unaligned. Also, I think that “HCH for a specific H” is more meaningful than “HCH for a specific level of competitiveness,” since we don’t really know what weird things you might need to do to produce an HCH with a given level of competitiveness.
Is “outer alignment” meant to be applicable in the general case?
I’m not exactly sure what you’re asking here.
Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we’d call that an “outer alignment failure”.
I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn’t consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).
So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?
Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.
Also, related to Ofer’s comment, can you clarify whether it’s intended for this definition that the loss function only looks at the model’s input/output behavior, or can it also take into account other information about the model?
I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.
I’m also curious whether you have HBO or LBO in mind for this post.
I generally think you need HBO and am skeptical that LBO can actually do very much.
I may have asked this already somewhere, but do you know if there’s a notion of “outer aligned” that is applicable to oracles/predictors in general, as opposed to trying to approximate/predict HCH specifically? Basically the problem is that I don’t know what “aligned” or “trying to do what we want” could mean in the general case. Is “outer alignment” meant to be applicable in the general case?
This post talks about outer alignment of the loss function. Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we’d call that an “outer alignment failure”. Or would it make more sense to use different terminology for that?
So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?
Also, related to Ofer’s comment, can you clarify whether it’s intended for this definition that the loss function only looks at the model’s input/output behavior, or can it also take into account other information about the model?
I believe the point about Turing machines was that given Low Bandwidth Overseer, it’s not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines. But other issues arise with HBO, as William Saunders wrote in the above linked post:
I understand you don’t want to go into details about whether theoretical HCH is aligned or not here, but I still want to flag that “instruct them not to do dumb things like instantiate arbitrary Turing machines” seems rather misleading. I’m also curious whether you have HBO or LBO in mind for this post.
Another thing that maybe I didn’t make clear previously:
I agree, but if you’re instructing your humans not to instantiate arbitrary Turing machines, then that’s a competitiveness claim, not an alignment claim. I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn’t be aligned.
But if you put aside competitiveness, can’t HCH be trivially aligned? E.g., you give the humans making up HCH instructions to cause it to not be able to answer anything except simple arithmetic questions. So it seems that a claim of HCH being aligned is meaningless unless the claim is about being aligned at some level of competitiveness.
That’s a good point. What I really mean is that I think the sort of HCH that you get out of taking actual humans and giving them careful instructions is more likely to be uncompetitive than it is to be unaligned. Also, I think that “HCH for a specific H” is more meaningful than “HCH for a specific level of competitiveness,” since we don’t really know what weird things you might need to do to produce an HCH with a given level of competitiveness.
I’m not exactly sure what you’re asking here.
I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn’t consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).
Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.
I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.
I generally think you need HBO and am skeptical that LBO can actually do very much.