That the specific ways people give feedback isn’t very relevant. It seems like the core thing determining the failure modes to me, e.g. if you just show the people who give feedback the source code of the program, then the failure mode will be that often the program will just immediately crash or maybe not even compile. Meanwhile if you show people the running program then that cannot be a failure mode.
If you agree that generally the failure modes are determined by the feedback process but somehow deceptive misalignment is an exception to the rule that the feedback process determines the failures then I don’t see the justification for that and would like that addressed explicitly.
Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I’m explicitly not addressing other failure modes in this post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
I’m explicitly not addressing other failure modes in this post.
Yes, I know, I gave the other failure modes as an example. The thing that confuses me is that you are saying that the (IMO) central piece in the AI algorithm doesn’t really matter for the purposes of your post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code?
Yes
Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
It’s not meant as an example of deceptive misalignment, it’s meant as an example of how alignment failures by-default depend absolutely massively on the way you train your AI. Like if you train your AI in a different way, you get different failures. So it seems like a strange prior to me to assume that you will get the same results wrt deceptive alignment regardless of how you train it.
That the specific ways people give feedback isn’t very relevant. It seems like the core thing determining the failure modes to me, e.g. if you just show the people who give feedback the source code of the program, then the failure mode will be that often the program will just immediately crash or maybe not even compile. Meanwhile if you show people the running program then that cannot be a failure mode.
If you agree that generally the failure modes are determined by the feedback process but somehow deceptive misalignment is an exception to the rule that the feedback process determines the failures then I don’t see the justification for that and would like that addressed explicitly.
Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I’m explicitly not addressing other failure modes in this post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
Yes, I know, I gave the other failure modes as an example. The thing that confuses me is that you are saying that the (IMO) central piece in the AI algorithm doesn’t really matter for the purposes of your post.
Yes
It’s not meant as an example of deceptive misalignment, it’s meant as an example of how alignment failures by-default depend absolutely massively on the way you train your AI. Like if you train your AI in a different way, you get different failures. So it seems like a strange prior to me to assume that you will get the same results wrt deceptive alignment regardless of how you train it.