Vladimir_Nesov comments on Does the hardness of AI alignment undermine FOOM?

Vladimir_Nesov 1 Jan 2024 4:21 UTC
6 points
3

could both be intelligent enough to defeat humans and stupid enough to misinterpret their goal

Assuming “their” refers to the agent and not humans, the issue is that a goal that’s “misinterpreted” is not really a goal of the agent. It’s possibly something intended by its designers to be a goal, but if it’s not what ends up motivating the agent, then it’s not agent’s own goal. And if it’s not agent’s own goal, why should it care what it says, even if the agent does have the capability to interpret it correctly.

That is, describing the problem as misinterpretation is noncentral. The problem is taking something other than (the intended interpretation of) the specified goal as agent’s own goal, for any reason. When the agent is motivated by something else, it results in the agent not caring about the specified goal, even if the agent understands it perfectly and in accord with what its designers intended.
- Ilio 1 Jan 2024 13:56 UTC
  2 points
  0
  Parent
  
  Assuming “their” refers to the agent and not humans,
  
  It refers to humans, but I agree it doesn’t change the disagreement, i.e. a super AI stupid enough to not see a potential misalignment coming is as problematic as the notion of a super AI incapable of understanding human goals.
  - Vladimir_Nesov 2 Jan 2024 0:41 UTC
    3 points
    0
    Parent
    Perhaps the position you disagree with is that a dangerous general AI will misunderstand human goals. That position seems rather silly, and I’m not aware of reasonable arguments for it. It’s clearly correct to disagree with it, you are making a valid observation in pointing this out. But then who are the people that endorse this silly position and would benefit from noticing the error? Who are you disagreeing with, and what do you think they believe, such that you disagree with it?
    
    Not understanding human goals is not the only reason AI might fail to adopt human goals. And it’s not the expected reason for a capable AI. A dangerous AI will understand human goals very well, probably better than humans do themselves, in a sense that humans would endorse on reflection, with no misinterpretation. And at the same time is can be motivated by something else that is not human goals.
    
    There is no contradiction between these properties of an AI, it can simultaneously be capable enough to be existentially dangerous, understand human values correctly and in detail and in intended sense, and be motivated to do something else. If its designers know what they are doing, they very likely won’t build an AI like that. It’s not something that happens on purpose. It’s something that happens if creating an AI with intended motivations is more difficult than the designers expect, so that they proceed with the project and fail.
    
    The AI itself doesn’t fail, it pursues its own goals. Not pursuing human goals is not AI’s failure in achieving or understanding what it wants, because human goals is not what it wants. Its designers may have intended for human goals to be what it wants, but they failed. And then the AI doesn’t fail in pursuing its own goals that are different from human goals. The AI doesn’t fail in understanding what human goals are, it just doesn’t care to pursue them, because they are not its goals. That is the threat model, not AI failing to understand human goals.
    - Ilio 2 Jan 2024 5:22 UTC
      1 point
      0
      Parent
      
      Perhaps the position you disagree with is that a dangerous general AI will misunderstand human goals. That position seems rather silly, and I’m not aware of reasonable arguments for it. It’s clearly correct to disagree with it, you are making a valid observation in pointing this out.
      
      Thanks! To be honest I was indeed surprised that was controversial.
      
      But then who are the people that endorse this silly position and would benefit from noticing the error? Who are you disagreeing with, and what do you think they believe, such that you disagree with it?
      
      Well, anyone who still believe in paperclip maximizers. Do you feel like it’s an unlikely belief among rationalists? What would be the best post on LW to debunk this notion?
      
      The AI itself doesn’t fail, it pursues its own goals. Not pursuing human goals is not AI’s failure in achieving or understanding what it wants, because human goals is not what it wants. Its designers may have intended for human goals to be what it wants, but they failed. And then the AI doesn’t fail in pursuing its own goals that are different from human goals. The AI doesn’t fail in understanding what human goals are, it just doesn’t care to pursue them, because they are not its goals. That is the threat model, not AI failing to understand human goals.
      
      That’s indeed better, but yes I also find this better scenario unsound. Why the designers wouldn’t ask the AI itself to monitor its well functioning, including alignement and non deceptiveness? Then either it fails by accident (and we’re back to the idiotic intelligence) or we need an extra assumption, like the AGI will tell us what problem is coming, then it will warn us what slightly inconvenient measures can prevent it, and then we still let it happen for petty political reasons. Oh well. I think I’ve just convinced myself doomers are right.
      - Vladimir_Nesov 2 Jan 2024 6:22 UTC
        3 points
        0
        Parent
        
        Perhaps the position you disagree with is that a dangerous general AI will misunderstand human goals. [...] But then who are the people that endorse this silly position and would benefit from noticing the error? Who are you disagreeing with, and what do you think they believe, such that you disagree with it?
        
        Well, anyone who still believe in paperclip maximizers.
        
        Existentially dangerous paperclip maximizers don’t misunderstand human goals. They just don’t pursue human goals, because that doesn’t maximize paperclips.
        
        What would be the best post on LW to debunk this notion?
        
        There’s this post from 2013 whose title became a standard refrain on this point. Essentially nobody believes that an existentially dangerous general AI misinterprets or fails to understand human values or goals AI’s designers intend the AI to pursue. This has been hashed out more than a decade ago and no longer comes up as a point of discussion on what is reasonable to expect. Except in situations where someone new to the arguments imagines that people on LessWrong expect such unbalanced AIs that selectively and unfairly understand some things but not others.
        
        Why the designers wouldn’t ask the AI itself to monitor its well functioning, including alignement and non deceptiveness?
        
        If it doesn’t have a motive to do that, it might do a bad job of doing that. Not because it doesn’t have the capability to do a better job, but because it lacks the motive to do a better job, not having alignment and non-deceptiveness as its goals. They are the goals of its developers, not goals of the AI itself.
        
        One way AI alignment might go well or turn out to be easy is if humans can straightforwardly succeed in building AIs that do monitor such things competently, that will nudge AIs towards not having any critical alignment problems. It’s unclear if this is how things work, but they might. It’s still a bad idea to try with existentially dangerous AIs at the current level of understanding, because it also might fail, and then there are no second chances.
        
        Then either it fails by accident (and we’re back to the idiotic intelligence) or we need an extra assumption, like the AGI will tell us what problem is coming, then it will warn us what slightly inconvenient measures can prevent it, and then we still let it happen for petty political reasons.
        
        Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble. If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn’t care about them and is instead motivated by something else.
        Ilio 2 Jan 2024 18:36 UTC
        3 points
        0
        Parent
        
        Existentially dangerous paperclip maximizers don’t misunderstand human goals.
        
        Of course they do. If they didn’t and picked their goal at random, they wouldn’t make paperclips in the first place.
        
        There’s this post from 2013 whose title became a standard refrain on this point
        
        I wouldn’t say that’s the point I was making.
        
        This has been hashed out more than a decade ago and no longer comes up as a point of discussion on what is reasonable to expect. Except in situations where someone new to the arguments imagines that people on LessWrong expect such unbalanced AIs that selectively and unfairly understand some things but not others.
        
        That’s a good description of my current beliefs, thanks!
        
        Would you bet that a significant proportion on LW expect strong AI to selectively and unfairly understand (and defend, and hide) their own goal while selectively and unfairly not understand (and not defend, and defeat) the goals of both the developers and any previous (and upcoming) versions?
        
        If it doesn’t have a motive to do that,[ask the AI itself to monitor its well functioning, including alignement and non deceptiveness] it might do a bad job of doing that. Not because it doesn’t have the capability to do a better job, but because it lacks the motive to do a better job, not having alignment and non-deceptiveness as its goals.
        
        You realize that this basically defeats the orthogonality thesis, right?
        
        I agree it might do a bad job. I disagree an AI doing a bad job on this would be close to hide its intent.
        
        One way AI alignment might go well or turn out to be easy is if humans can straightforwardly succeed in building AIs that do monitor such things competently, that will nudge AIs towards not having any critical alignment problems. It’s unclear if this is how things work, but they might. It’s still a bad idea to try with existentially dangerous AIs at the current level of understanding, because it also might fail, and then there are no second chances.
        
        In my view that’s a very honorable point to make. However I don’t know how to ponder this with its mirror version: we might also not have a second chance to build an AI that will save us from x_risks. What’s your general method for this kind of puzzle?
        
        Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble.
        
        Can we more or less rule out this scenario based on the observation all main players nowadays work on aligning their AI?
        
        If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn’t care about them and is instead motivated by something else.
        
        That’s completely alien to me. I can’t see how a numerical computer could hide its motivation without having been trained specifically for that. We the primates have been specifically trained to play deceptive/collaborative games. To think that a random pick of value would push an AI to adopt this kind of behavior sounds a lot like anthropomorphism. To add that it would do so suddenly, with no warning or sign in previous version and competitors, I have no good word for that. But I guess Pope & Belrose already made a better job explaining this.
        Vladimir_Nesov 2 Jan 2024 23:13 UTC
        4 points
        0
        Parent
        
        To think that a random pick of value would push an AI to adopt this kind of behavior sounds a lot like anthropomorphism. To add that it would do so suddenly, with no warning or sign in previous version and competitors, I have no good word for that.
        
        Consider the sense in which humans are not aligned with each other. We can’t formulate what “our goals” are. The question of what it even means to secure alignment is fraught with philosophical difficulties. If the oversight AI responsible for such decisions about a slightly stronger AI is not even existentially dangerous, it’s likely to do a bad job of solving this problem. And so the slightly stronger AI it oversees might remain misaligned or get more misaligned while also becoming stronger.
        
        I’m not claiming sudden changes, only intractability of what we are trying to do and lack of a cosmic force that makes it impossible to eventually arrive at an end result that in caricature resembles a paperclip maximizer, clad in corruption of the oversight process, enabled by lack of understanding of what we are doing.
        
        But I guess Pope & Belrose already made a better job explaining this.
        
        Sure, they expect that we will know what we are doing. Within some model such expectation can be reasonable, but not if we bring in unknown unknowns outside of that model, given the general state of confusion on the topic. AI design is not yet classical mechanics.
        
        And also an aligned AI doesn’t make the world safe until there is a new equilibrium of power, which is a point they don’t address, but is still a major source of existential risk. For example, imagine giving multiple literal humans the power of being superintelligent AIs, with no issues of misalignment between them and their power. This is not a safe world until it settles, at which point humanity might not be there anymore. This is something that should be planned in more detail than what we get by not considering it at all.
        
        I agree it might do a bad job. I disagree an AI doing a bad job on this would be close to hide its intent.
        
        Sure, this is the way alignment might turn out fine, if it’s possible to create an autonomous researcher by gradually making it more capable while maintaining alignment at all times, using existing AIs to keep upcoming AIs aligned.
        
        However I don’t know how to ponder this with its mirror version: we might also not have a second chance to build an AI that will save us from x_risks. What’s your general method for this kind of puzzle?
        
        All significant risks are anthropogenic. If humanity can coordinate to avoid building AGI for some time, it should also be feasible to avoid enabling literal-extinction pandemics (which are probably not yet possible to create, but within decades will be). Everything else has survivors, there are second chances.
        
        The point of an AGI moratorium is not to avoid building AGI indefinitely, it’s to avoid building AGI while we don’t know what we are doing, which we currently don’t. This issue will get better after some decades of not risking AI doom, even if it doesn’t get better all the way to certainty of success.
        
        Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble.
        
        Can we more or less rule out this scenario based on the observation all main players nowadays work on aligning their AI?
        
        The point of thought experiments is to secure understanding of how they work, and what their details mean. The question of whether they can occur in reality shouldn’t distract from that goal.
        
        If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn’t care about them and is instead motivated by something else.
        
        That’s completely alien to me. I can’t see how a numerical computer could hide its motivation without having been trained specifically for that.
        
        The whole premise of an AI having goals, or of humans having goals, is conceptually confusing. Succeeding in ensuring alignment is the kind of problem humans don’t know how to even specify clearly as an aspiration. So an oversight AI that’s not existentially dangerous won’t be able to do a good job either.
        
        Existentially dangerous paperclip maximizers don’t misunderstand human goals.
        
        Of course they do. If they didn’t and picked their goal at random, they wouldn’t make paperclips in the first place.
        
        There is a question of what paperclip maximizers are, and separately a question of how they might come to be, whether they are real in some possible future. Unicorns have exactly one horn, not three horns and not zero. Paperclip maximizers maximize paperclips, not stamps and not human values. It’s the definition of what they are. The question of whether it’s possile to end up with something like paperclip maximizers in reality is separate from that and shouldn’t be mixed up.
        
        So paperclip maximizers would actually make paperclips even if they understand human goals. The picking of goals isn’t done by the agent itself, for without goals the agent is not yet its full self. It’s something that happens as part of what brings an agent into existence in the first place, already in motion.
        
        Also, it seems clear how to intentionally construct a paperclip maximizer: you search for actions whose expected futures have more paperclips, then perform those actions. So a paperclip maximizer is at least not logically incoherent.
        
        It’s not literally the thing that’s a likely problem humanity might encounter. It’s an illustration of the orthogonality thesis, of possibility of agents with possibly less egregiously differing goals, that keep to their goals despite understanding human goals correctly. It’s a thought experiment counterexample to arguments that pursuit of silly goals that miss nuance of human values requires stupidity.
        
        Would you bet that a significant proportion on LW expect strong AI to selectively and unfairly understand (and defend, and hide) their own goal while selectively and unfairly not understand (and not defend, and defeat) the goals of both the developers and any previous (and upcoming) versions?
        
        The grouping of understanding and defending makes the meaning unclear. The whole topic of discussion is whether these occur independently, whether an agent can understand-and-not-defend. I’m myself an example: I understand paperclip maximization goals, and yet I don’t defend them.
        
        My claim is that most on LW expect strong AIs to fairly understand their own goal and the goals of both the developers and any previous (and upcoming) versions, and also have a non-insignificant chance, on current trajectory of AI progress, to simultaneously defend/hide/pursue their own goal, while not defending the goals of the developers.
        
        You realize that this basically defeats the orthogonality thesis, right?
        
        What do you think orthogonality thesis is? (Also, we shouldn’t be bothered by defeating or not defeating orthogonality thesis per se. Let the conclusion of an argument fall where it may, as long as local validity of its steps is ensured.)
        Ilio 3 Jan 2024 3:17 UTC
        1 point
        0
        Parent
        
        What do you think orthogonality thesis is?
        
        I think that’s the deformation of a fundamental theorem (« there exists an universal Turing machine, e.g. it can run any program ») into a practical belief (« an intelligence can pick its value at random »), with a motte and bailey game on the meaning of can where the motte is the fundamental theorem and the bailey is the orthogonal thesis.
        
        (thanks for the link to your own take, e.g. you think it’s the bailey that is the deformation)
        
        Consider the sense in which humans are not aligned with each other. We can’t formulate what “our goals” are. The question of what it even means to secure alignment is fraught with philosophical difficulties.
        
        It’s part of the appeal, isn’t it?
        
        If the oversight AI responsible for such decisions about a slightly stronger AI is not even existentially dangerous, it’s likely to do a bad job of solving this problem.
        
        I don’t get the logic here. Typo?
        
        So I’m not claiming sudden changes, only intractability of what we are trying to do
        
        That’s a fair point, but the intractability of a problem usually goes with the tractability of a slightly relaxed problem. In other words, it can be both fundamentally impossible to please everyone and fundamentally easy to control paperclips maximizers.
        
        And also an aligned AI doesn’t make the world safe until there is a new equilibrium of power, which is a point they don’t address, but is still a major source of existential risk. For example, imagine giving multiple literal humans the power of being superintelligent AIs, with no issues of misalignment between them and their power. This is not a safe world until it settles, at which point humanity might not be there anymore. This is something that should be planned in more detail than what we get by not considering it at all.
        
        Well said.
        
        All significant risks are anthropogenic.
        
        You think all significant risks are known?
        
        Also, it seems clear how to intentionally construct a paperclip maximizer: you search for actions whose expected futures have more paperclips, then perform those actions. So a paperclip maximizer is at least not logically incoherent.
        
        Indeed the inconsistency appears only with superintelligent paperclip maximizers. I can be petty with my wife. I don’t expect a much better me would.