tailcalled comments on Deceptive Alignment is <1% Likely by Default

tailcalled 18 Apr 2023 7:51 UTC
3 points
1

I’m assuming prosaic development of TAI, using a training process like human feedback on diverse tasks (HFDT). The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations. This high-level training setup is already the default training process for text models such as ChatGPT, and this will likely continue because of the flexibility and strong performance it provides. I also expect unsupervised pre-training to be an important part of TAI development.

Presumably some of the tasks it might get feedback on are tasks like marketing, running a company, red-teaming computer security, bioengineering, writing textbooks, etc.? However I do not know what exact training/feedback setup you have in mind for these tasks. Could you expand?
- DavidW 18 Apr 2023 11:59 UTC
  0 points
  −1
  Parent
  From Ajeya Cotra’s post that I linked to:
  Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
  It’s not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions.
  - tailcalled 18 Apr 2023 12:13 UTC
    3 points
    0
    Parent
    I’m not so much asking the question of what the tasks are, and instead asking what exactly the setup would be.
    
    For example, if I understand the paper Cotra linked to correctly, they directly showed the raters what the model’s output was and asked them to rate it. Is this also the feedback mode you are assuming in your post?
    
    For example in order to train an AI to do advanced software development, would you show unspecialized workers in India how the model describes it would edit the code? If not, what feedback signal are you assuming?
    - DavidW 19 Apr 2023 14:18 UTC
      0 points
      −1
      Parent
      I don’t think that the specific ways people give feedback is very relevant. This post is about deceptive misalignment, which is really about inner misalignment. Also, I’m assuming that this a process that enables TAI to emerge, especially the first time, and asking people who don’t know about a topic to give feedback probably won’t be the strategy that gets us there. Does that answer your question?
      - tailcalled 19 Apr 2023 14:24 UTC
        3 points
        1
        Parent
        
        Does that answer your question?
        
        Yes but then I disagree with the assumptions underlying your post and expect things that are based on your post to be derailed by the errors that have been introduced.
        DavidW 19 Apr 2023 14:30 UTC
        1 point
        0
        Parent
        Which assumptions are wrong? Why?
        tailcalled 19 Apr 2023 15:09 UTC
        3 points
        1
        Parent
        That the specific ways people give feedback isn’t very relevant. It seems like the core thing determining the failure modes to me, e.g. if you just show the people who give feedback the source code of the program, then the failure mode will be that often the program will just immediately crash or maybe not even compile. Meanwhile if you show people the running program then that cannot be a failure mode.
        
        If you agree that generally the failure modes are determined by the feedback process but somehow deceptive misalignment is an exception to the rule that the feedback process determines the failures then I don’t see the justification for that and would like that addressed explicitly.
        DavidW 19 Apr 2023 15:24 UTC
        1 point
        0
        Parent
        Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I’m explicitly not addressing other failure modes in this post.
        What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
        tailcalled 19 Apr 2023 21:12 UTC
        2 points
        0
        Parent
        I’m explicitly not addressing other failure modes in this post.
        Yes, I know, I gave the other failure modes as an example. The thing that confuses me is that you are saying that the (IMO) central piece in the AI algorithm doesn’t really matter for the purposes of your post.
        What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code?
        Yes
        Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
        It’s not meant as an example of deceptive misalignment, it’s meant as an example of how alignment failures by-default depend absolutely massively on the way you train your AI. Like if you train your AI in a different way, you get different failures. So it seems like a strange prior to me to assume that you will get the same results wrt deceptive alignment regardless of how you train it.