Ilio comments on Does the hardness of AI alignment undermine FOOM?

Ilio 2 Jan 2024 18:36 UTC
3 points
0

Existentially dangerous paperclip maximizers don’t misunderstand human goals.

Of course they do. If they didn’t and picked their goal at random, they wouldn’t make paperclips in the first place.

There’s this post from 2013 whose title became a standard refrain on this point

I wouldn’t say that’s the point I was making.

This has been hashed out more than a decade ago and no longer comes up as a point of discussion on what is reasonable to expect. Except in situations where someone new to the arguments imagines that people on LessWrong expect such unbalanced AIs that selectively and unfairly understand some things but not others.

That’s a good description of my current beliefs, thanks!

Would you bet that a significant proportion on LW expect strong AI to selectively and unfairly understand (and defend, and hide) their own goal while selectively and unfairly not understand (and not defend, and defeat) the goals of both the developers and any previous (and upcoming) versions?

If it doesn’t have a motive to do that,[ask the AI itself to monitor its well functioning, including alignement and non deceptiveness] it might do a bad job of doing that. Not because it doesn’t have the capability to do a better job, but because it lacks the motive to do a better job, not having alignment and non-deceptiveness as its goals.

You realize that this basically defeats the orthogonality thesis, right?

I agree it might do a bad job. I disagree an AI doing a bad job on this would be close to hide its intent.

One way AI alignment might go well or turn out to be easy is if humans can straightforwardly succeed in building AIs that do monitor such things competently, that will nudge AIs towards not having any critical alignment problems. It’s unclear if this is how things work, but they might. It’s still a bad idea to try with existentially dangerous AIs at the current level of understanding, because it also might fail, and then there are no second chances.

In my view that’s a very honorable point to make. However I don’t know how to ponder this with its mirror version: we might also not have a second chance to build an AI that will save us from x_risks. What’s your general method for this kind of puzzle?

Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble.

Can we more or less rule out this scenario based on the observation all main players nowadays work on aligning their AI?

If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn’t care about them and is instead motivated by something else.

That’s completely alien to me. I can’t see how a numerical computer could hide its motivation without having been trained specifically for that. We the primates have been specifically trained to play deceptive/collaborative games. To think that a random pick of value would push an AI to adopt this kind of behavior sounds a lot like anthropomorphism. To add that it would do so suddenly, with no warning or sign in previous version and competitors, I have no good word for that. But I guess Pope & Belrose already made a better job explaining this.
- Vladimir_Nesov 2 Jan 2024 23:13 UTC
  4 points
  0
  Parent
  
  To think that a random pick of value would push an AI to adopt this kind of behavior sounds a lot like anthropomorphism. To add that it would do so suddenly, with no warning or sign in previous version and competitors, I have no good word for that.
  
  Consider the sense in which humans are not aligned with each other. We can’t formulate what “our goals” are. The question of what it even means to secure alignment is fraught with philosophical difficulties. If the oversight AI responsible for such decisions about a slightly stronger AI is not even existentially dangerous, it’s likely to do a bad job of solving this problem. And so the slightly stronger AI it oversees might remain misaligned or get more misaligned while also becoming stronger.
  
  I’m not claiming sudden changes, only intractability of what we are trying to do and lack of a cosmic force that makes it impossible to eventually arrive at an end result that in caricature resembles a paperclip maximizer, clad in corruption of the oversight process, enabled by lack of understanding of what we are doing.
  
  But I guess Pope & Belrose already made a better job explaining this.
  
  Sure, they expect that we will know what we are doing. Within some model such expectation can be reasonable, but not if we bring in unknown unknowns outside of that model, given the general state of confusion on the topic. AI design is not yet classical mechanics.
  
  And also an aligned AI doesn’t make the world safe until there is a new equilibrium of power, which is a point they don’t address, but is still a major source of existential risk. For example, imagine giving multiple literal humans the power of being superintelligent AIs, with no issues of misalignment between them and their power. This is not a safe world until it settles, at which point humanity might not be there anymore. This is something that should be planned in more detail than what we get by not considering it at all.
  
  I agree it might do a bad job. I disagree an AI doing a bad job on this would be close to hide its intent.
  
  Sure, this is the way alignment might turn out fine, if it’s possible to create an autonomous researcher by gradually making it more capable while maintaining alignment at all times, using existing AIs to keep upcoming AIs aligned.
  
  However I don’t know how to ponder this with its mirror version: we might also not have a second chance to build an AI that will save us from x_risks. What’s your general method for this kind of puzzle?
  
  All significant risks are anthropogenic. If humanity can coordinate to avoid building AGI for some time, it should also be feasible to avoid enabling literal-extinction pandemics (which are probably not yet possible to create, but within decades will be). Everything else has survivors, there are second chances.
  
  The point of an AGI moratorium is not to avoid building AGI indefinitely, it’s to avoid building AGI while we don’t know what we are doing, which we currently don’t. This issue will get better after some decades of not risking AI doom, even if it doesn’t get better all the way to certainty of success.
  
  Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble.
  
  Can we more or less rule out this scenario based on the observation all main players nowadays work on aligning their AI?
  
  The point of thought experiments is to secure understanding of how they work, and what their details mean. The question of whether they can occur in reality shouldn’t distract from that goal.
  
  If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn’t care about them and is instead motivated by something else.
  
  That’s completely alien to me. I can’t see how a numerical computer could hide its motivation without having been trained specifically for that.
  
  The whole premise of an AI having goals, or of humans having goals, is conceptually confusing. Succeeding in ensuring alignment is the kind of problem humans don’t know how to even specify clearly as an aspiration. So an oversight AI that’s not existentially dangerous won’t be able to do a good job either.
  
  Existentially dangerous paperclip maximizers don’t misunderstand human goals.
  
  Of course they do. If they didn’t and picked their goal at random, they wouldn’t make paperclips in the first place.
  
  There is a question of what paperclip maximizers are, and separately a question of how they might come to be, whether they are real in some possible future. Unicorns have exactly one horn, not three horns and not zero. Paperclip maximizers maximize paperclips, not stamps and not human values. It’s the definition of what they are. The question of whether it’s possile to end up with something like paperclip maximizers in reality is separate from that and shouldn’t be mixed up.
  
  So paperclip maximizers would actually make paperclips even if they understand human goals. The picking of goals isn’t done by the agent itself, for without goals the agent is not yet its full self. It’s something that happens as part of what brings an agent into existence in the first place, already in motion.
  
  Also, it seems clear how to intentionally construct a paperclip maximizer: you search for actions whose expected futures have more paperclips, then perform those actions. So a paperclip maximizer is at least not logically incoherent.
  
  It’s not literally the thing that’s a likely problem humanity might encounter. It’s an illustration of the orthogonality thesis, of possibility of agents with possibly less egregiously differing goals, that keep to their goals despite understanding human goals correctly. It’s a thought experiment counterexample to arguments that pursuit of silly goals that miss nuance of human values requires stupidity.
  
  Would you bet that a significant proportion on LW expect strong AI to selectively and unfairly understand (and defend, and hide) their own goal while selectively and unfairly not understand (and not defend, and defeat) the goals of both the developers and any previous (and upcoming) versions?
  
  The grouping of understanding and defending makes the meaning unclear. The whole topic of discussion is whether these occur independently, whether an agent can understand-and-not-defend. I’m myself an example: I understand paperclip maximization goals, and yet I don’t defend them.
  
  My claim is that most on LW expect strong AIs to fairly understand their own goal and the goals of both the developers and any previous (and upcoming) versions, and also have a non-insignificant chance, on current trajectory of AI progress, to simultaneously defend/hide/pursue their own goal, while not defending the goals of the developers.
  
  You realize that this basically defeats the orthogonality thesis, right?
  
  What do you think orthogonality thesis is? (Also, we shouldn’t be bothered by defeating or not defeating orthogonality thesis per se. Let the conclusion of an argument fall where it may, as long as local validity of its steps is ensured.)
  - Ilio 3 Jan 2024 3:17 UTC
    1 point
    0
    Parent
    
    What do you think orthogonality thesis is?
    
    I think that’s the deformation of a fundamental theorem (« there exists an universal Turing machine, e.g. it can run any program ») into a practical belief (« an intelligence can pick its value at random »), with a motte and bailey game on the meaning of can where the motte is the fundamental theorem and the bailey is the orthogonal thesis.
    
    (thanks for the link to your own take, e.g. you think it’s the bailey that is the deformation)
    
    Consider the sense in which humans are not aligned with each other. We can’t formulate what “our goals” are. The question of what it even means to secure alignment is fraught with philosophical difficulties.
    
    It’s part of the appeal, isn’t it?
    
    If the oversight AI responsible for such decisions about a slightly stronger AI is not even existentially dangerous, it’s likely to do a bad job of solving this problem.
    
    I don’t get the logic here. Typo?
    
    So I’m not claiming sudden changes, only intractability of what we are trying to do
    
    That’s a fair point, but the intractability of a problem usually goes with the tractability of a slightly relaxed problem. In other words, it can be both fundamentally impossible to please everyone and fundamentally easy to control paperclips maximizers.
    
    And also an aligned AI doesn’t make the world safe until there is a new equilibrium of power, which is a point they don’t address, but is still a major source of existential risk. For example, imagine giving multiple literal humans the power of being superintelligent AIs, with no issues of misalignment between them and their power. This is not a safe world until it settles, at which point humanity might not be there anymore. This is something that should be planned in more detail than what we get by not considering it at all.
    
    Well said.
    
    All significant risks are anthropogenic.
    
    You think all significant risks are known?
    
    Also, it seems clear how to intentionally construct a paperclip maximizer: you search for actions whose expected futures have more paperclips, then perform those actions. So a paperclip maximizer is at least not logically incoherent.
    
    Indeed the inconsistency appears only with superintelligent paperclip maximizers. I can be petty with my wife. I don’t expect a much better me would.