I’m assuming prosaic development of TAI, using a training process like human feedback on diverse tasks (HFDT). The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations. This high-level training setup is already the default training process for text models such as ChatGPT, and this will likely continue because of the flexibility and strong performance it provides. I also expect unsupervised pre-training to be an important part of TAI development.
Presumably some of the tasks it might get feedback on are tasks like marketing, running a company, red-teaming computer security, bioengineering, writing textbooks, etc.? However I do not know what exact training/feedback setup you have in mind for these tasks. Could you expand?
Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
It’s not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions.
I’m not so much asking the question of what the tasks are, and instead asking what exactly the setup would be.
For example, if I understand the paper Cotra linked to correctly, they directly showed the raters what the model’s output was and asked them to rate it. Is this also the feedback mode you are assuming in your post?
For example in order to train an AI to do advanced software development, would you show unspecialized workers in India how the model describes it would edit the code? If not, what feedback signal are you assuming?
I don’t think that the specific ways people give feedback is very relevant. This post is about deceptive misalignment, which is really about inner misalignment. Also, I’m assuming that this a process that enables TAI to emerge, especially the first time, and asking people who don’t know about a topic to give feedback probably won’t be the strategy that gets us there. Does that answer your question?
Yes but then I disagree with the assumptions underlying your post and expect things that are based on your post to be derailed by the errors that have been introduced.
That the specific ways people give feedback isn’t very relevant. It seems like the core thing determining the failure modes to me, e.g. if you just show the people who give feedback the source code of the program, then the failure mode will be that often the program will just immediately crash or maybe not even compile. Meanwhile if you show people the running program then that cannot be a failure mode.
If you agree that generally the failure modes are determined by the feedback process but somehow deceptive misalignment is an exception to the rule that the feedback process determines the failures then I don’t see the justification for that and would like that addressed explicitly.
Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I’m explicitly not addressing other failure modes in this post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
I’m explicitly not addressing other failure modes in this post.
Yes, I know, I gave the other failure modes as an example. The thing that confuses me is that you are saying that the (IMO) central piece in the AI algorithm doesn’t really matter for the purposes of your post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code?
Yes
Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
It’s not meant as an example of deceptive misalignment, it’s meant as an example of how alignment failures by-default depend absolutely massively on the way you train your AI. Like if you train your AI in a different way, you get different failures. So it seems like a strange prior to me to assume that you will get the same results wrt deceptive alignment regardless of how you train it.
Presumably some of the tasks it might get feedback on are tasks like marketing, running a company, red-teaming computer security, bioengineering, writing textbooks, etc.? However I do not know what exact training/feedback setup you have in mind for these tasks. Could you expand?
From Ajeya Cotra’s post that I linked to:
It’s not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions.
I’m not so much asking the question of what the tasks are, and instead asking what exactly the setup would be.
For example, if I understand the paper Cotra linked to correctly, they directly showed the raters what the model’s output was and asked them to rate it. Is this also the feedback mode you are assuming in your post?
For example in order to train an AI to do advanced software development, would you show unspecialized workers in India how the model describes it would edit the code? If not, what feedback signal are you assuming?
I don’t think that the specific ways people give feedback is very relevant. This post is about deceptive misalignment, which is really about inner misalignment. Also, I’m assuming that this a process that enables TAI to emerge, especially the first time, and asking people who don’t know about a topic to give feedback probably won’t be the strategy that gets us there. Does that answer your question?
Yes but then I disagree with the assumptions underlying your post and expect things that are based on your post to be derailed by the errors that have been introduced.
Which assumptions are wrong? Why?
That the specific ways people give feedback isn’t very relevant. It seems like the core thing determining the failure modes to me, e.g. if you just show the people who give feedback the source code of the program, then the failure mode will be that often the program will just immediately crash or maybe not even compile. Meanwhile if you show people the running program then that cannot be a failure mode.
If you agree that generally the failure modes are determined by the feedback process but somehow deceptive misalignment is an exception to the rule that the feedback process determines the failures then I don’t see the justification for that and would like that addressed explicitly.
Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I’m explicitly not addressing other failure modes in this post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
Yes, I know, I gave the other failure modes as an example. The thing that confuses me is that you are saying that the (IMO) central piece in the AI algorithm doesn’t really matter for the purposes of your post.
Yes
It’s not meant as an example of deceptive misalignment, it’s meant as an example of how alignment failures by-default depend absolutely massively on the way you train your AI. Like if you train your AI in a different way, you get different failures. So it seems like a strange prior to me to assume that you will get the same results wrt deceptive alignment regardless of how you train it.