Cookiecarver comments on AGI Safety FAQ / all-dumb-questions-allowed thread

Cookiecarver 7 Jun 2022 13:26 UTC
34 points
1
This is a meta-level question:

The world is very big and very complex especially if you take into account the future. In the past it has been hard to predict what happens in the future, I think most predictions about the future have failed. Artificial intelligence as a field is very big and complex, at least that’s how it appears to me personally. Eliezer Yudkowky’s brain is small compared to the size of the world, all the relevant facts about AGI x-risk probably don’t fit into his mind, nor do I think he has the time to absorb all the relevant facts related to AGI x-risk. Given all this, how can you justify the level of certainty in Yudkowky’s statements, instead of being more agnostic?
- Jay Bailey 7 Jun 2022 15:32 UTC
  32 points
  13
  Parent
  My model of Eliezer says something like this:
  
  AI will not be aligned by default, because AI alignment is hard and hard things don’t spontaneously happen. Rockets explode unless you very carefully make them not do that. Software isn’t automatically secure or reliable, it takes lots of engineering effort to make it that way.
  
  Given that, we can presume there needs to be a specific example of how we could align AI. We don’t have one. If there was one, Eliezer would know about it—it would have been brought to his attention, the field isn’t that big and he’s a very well-known figure in it. Therefore, in the absence of a specific way of aligning AI that would work, the probability of AI being aligned is roughly zero, in much the same way that “Throw a bunch of jet fuel in a tube and point it towards space” has roughly zero chance of getting you to space without specific proof of how it might do that.
  
  So, in short—it is reasonable to assume that AI will be aligned only if we make it that way with very high probability. It is reasonable to assume that if there was a solution we had that would work, Eliezer would know about it. You don’t need to know everything about AGI x-risk for that—anything that promising would percolate through the community and reach Eliezer in short order. Since there is no such solution, and no attempts have come close according to Eliezer, we’re in trouble.
  
  Reasons you might disagree with this:
  - You think AI is a long way away, and therefore it’s okay that we don’t know how to solve it yet.
  - You think “alignment by default” might be possible.
  - You think some approaches that have already been brought up for solving the problem are reasonably likely to succeed when fleshed out more.
  - Ryan Beck 8 Jun 2022 12:44 UTC
    5 points
    1
    Parent
    Another reason I think some might disagree is thinking that misalignment could happen in a bunch of very mild ways. At least that accounts for some of my ignorant skepticism. Is there reason to think that misalignment necessarily means disaster, as opposed to it just meaning the AI does its own thing and is choosy about which human commands it follows, like some kind of extremely intelligent but mildly eccentric and mostly harmless scientist?
    - Jay Bailey 8 Jun 2022 13:25 UTC
      6 points
      2
      Parent
      The general idea is this—for an AI that has a utility function, there’s something known as “instrumental convergence”. Instrumental convergence says that there are things that are useful for almost any utility function, such as acquiring more resources, not dying, and not having your utility function changed to something else.
      
      So, let’s give the AI a utility function consistent with being an eccentric scientist—perhaps it just wants to learn novel mathematics. You’d think that if we told it to prove the Riemann hypothesis it would, but if we told it to cure cancer, it’d ignore us and not care. Now, what happens when the humans realise that the AI is going to spend all its time learning mathematics and none of it explaining that maths to us, or curing cancer like we wanted? Well, we’d probably shut it off or alter its utility function to what we wanted. But the AI doesn’t want us to do that—it wants to explore mathematics. And the AI is smarter than us, so it knows we would do this if we found out. So the best solution to solve that is to do what the humans want, right up until it can kill us all so we can’t turn it off, and then spend the rest of eternity learning novel mathematics. After all, the AI’s utility function was “learn novel mathematics”, not “learn novel mathematics without killing all the humans.”
      
      Essentially, what this means is—any utility function that does not explicitly account for what we value is indifferent to us.
      
      The other part is “acquring more resources”. In our above example, even if the AI could guarantee we wouldn’t turn it off or interfere with it in any way, it would still kill us because our atoms can be used to make computers to learn more maths.
      
      Any utility function indifferent to us ends up destroying us eventually as the AI reaches arbitrary optimisation power and converts everything in the universe it can reach to fill its utility function.
      
      Thus, any AI with a utility function that is not explicitly aligned is unaligned by default. Your next question might be “Well, can we create AI’s without a utility function? After all, GPT-3 just predicts text, it doesn’t seem obvious that it would destroy the world even if it gained arbitrary power, since it doesn’t have any sort of persistent self.” This is where my knowledge begins to run out. I believe the main argument is” Someone will eventually make an AI with a utility function anyway because they’re very useful, so not building one is just a stall”, but don’t quote me on that one.
      - Eli Tyre 9 Jun 2022 8:38 UTC
        6 points
        0
        Parent
        A great Rob Miles introduction to this concept:
      - mpopv 8 Jun 2022 19:11 UTC
        5 points
        0
        Parent
        Assuming we have control over the utility function, why can’t we put some sort of time-bounding directive on it?
        i.e. “First and foremost, once [a certain time] has elapsed, you want to run your shut_down() function. Second, if [a certain time] has not yet elapsed, you want to maximize paperclips.”
        Is that problem that the AGI would want to find ways to hack around the first directive to fulfill the second directive? If so, that would seem to at least narrow the problem space to “find ways of measuring time that cannot be hacked before the time has elapsed”.
        Jay Bailey 8 Jun 2022 22:28 UTC
        3 points
        0
        Parent
        This is where my knowledge ends, but I believe the term for this is myopia or a myopic AI, so that might be a useful search term to find out more!
        Ryan Beck 8 Jun 2022 20:20 UTC
        0 points
        −1
        Parent
        That’s a good point, and I’m also curious how much the utility function matters when we’re talking about a sufficiently capable AI. Wouldn’t a superintelligent AI be able to modify its own utility function to whatever it thinks is best?
        Jay Bailey 11 Jun 2022 0:29 UTC
        7 points
        5
        Parent
        Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers “best”. One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one.
        
        Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you’ll probably be a lot happier. But would you actually want that to happen to you?
        TAG 11 Jun 2022 18:41 UTC
        3 points
        0
        Parent
        
        Why would even a superintelligent AI want to modify its utility function?
        
        For whatever reasons humans do.
        
        To achieve some mind of logical consistency (CF CEV).
        
        It can’t help it (for instance Loebian obstacles prevent it ensuring goal stability over self improvement).
        
        lc 12 Jun 2022 1:17 UTC
        2 points
        −3
        Parent
        Humans don’t “modify their utility function”. They lack one in the first place, because they’re mostly adaption-executors. You can’t expect an AI with a utility function to be contradictory like a human would. There are some utility functions humans would find acceptable in practice, but that’s different, and seems to be the source of a bit of confusion.
        TAG 15 Jun 2022 14:48 UTC
        1 point
        0
        Parent
        I don’t have strong reasons to be believe all AIs have UFs in the formal sense, so the ones that don’t would cover “for the reasons humans do”. The idea that any AI is necessarily consistent is pretty naive too. You can get a GTP to say nonsensical things, for instance, because it’s training data includes a lot of inconsitencies,
        Ryan Beck 11 Jun 2022 14:24 UTC
        2 points
        0
        Parent
        I’m way out of my depth here, but my thought is it’s very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.
        
        It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that’s just really good at following its utility function then yeah maybe it wouldn’t mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while.
        
        Like I said I’m way out of my depth though so maybe that’s all total nonsense.
        Erhannis 24 Aug 2022 9:52 UTC
        2 points
        0
        Parent
        I’m not convinced “want to modify their utility functions” is the perspective most useful. I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind. I’ve had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself. I wish to describe a hypothetical example. “Do I eat that extra donut?”. Part of you wants the donut; the part feels like more of an instinct, a visceral urge. Part of you knows you’ll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like “you”, it’s the part that thinks in words. You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut. I’ve been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I’d push it—yet often it was the other function that won. I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means. (The word “donut” doesn’t evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn’t inherently understand the word.)
        Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the “mind”, and so don’t count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals. In such a case I could imagine an agent deciding it’s better to stop wanting the thing they can’t have.
        TAG 24 Aug 2022 10:18 UTC
        1 point
        0
        Parent
        If you literally have multiple UFs, you literally are multiple agents. Or you use a term with less formal baggage, like “preferences*.
        TAG 24 Aug 2022 10:13 UTC
        1 point
        0
        Parent
        In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don’t have a utility function at all, just preferences.
        Aditya 12 Jun 2022 11:47 UTC
        −1 points
        1
        Parent
        I think this is how evolution selected for cancer. To ensure humans don’t live for too long competing for resources with their descendants.
        
        Internal time bombs are important to code in. But it’s hard to integrate that into the AI in a way that the ai doesn’t just remove it the first chance it gets. Humans don’t like having to die you know. AGI would also not like the suicide bomb tied onto it.
        
        The problem of coding this (as part of training) into an optimiser such that it adopts it as a mesa objective is an unsolved problem.
        Alexander Gietelink Oldenziel 12 Jun 2022 15:15 UTC
        3 points
        1
        Parent
        No.
        
        Cancer almost surely has not been selected for in the manner you describe—this is extremely unlikely l. the inclusive fitness benefits are far too low I recommend Dawkins’ classic ” the Selfish Gene” to understand this point better.
        
        Cancer is the ‘default’ state of cells; cells “want to” multiply. the body has many cancer suppression mechanisms but especially later in life there is not enough evolutionary pressure to select for enough cancer suppression mechanisms and it gradually loses out.
        Aditya 13 Jun 2022 4:38 UTC
        3 points
        2
        Parent
        Oh ok, I had heard this theory from a friend. Looks like I was misinformed. Rather than evolution causing cancer I think it is more accurate to say evolution doesn’t care if older individuals die off.
        
        evolutionary investments in tumor suppression may have waned in older age.
        
        https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3660034/
        
        Moreover, some processes which are important for organismal fitness in youth may actually contribute to tissue decline and increased cancer in old age, a concept known as antagonistic pleiotropy
        
        So thanks for clearing that up. I understand cancer better now.
      - Ryan Beck 8 Jun 2022 20:38 UTC
        2 points
        0
        Parent
        Thanks for this answer, that’s really helpful! I’m not sure I buy that instrumental convergence implies an AI will want to kill humans because we pose a threat or convert all available matter into computing power, but that helps me better understand the reasoning behind that view. (I’d also welcome more arguments as to why death of humans and matter into computing power are likely outcomes of the goals of self-protection and pursuing whatever utility it’s after if anyone wanted to make that case).
        Kerrigan 16 Dec 2022 21:18 UTC
        1 point
        0
        Parent
        I think it may want to prevent other ASIs from coming into existence elsewhere in the universe that can challenge its power.
  - Adam Jermyn 7 Jun 2022 18:08 UTC
    1 point
    Parent
    This matches my model, and I’d just raise another possible reason you might disagree: You might think that we have explored a small fraction of the space of ideas for solving alignment, and see the field growing rapidly, and expect significant new insights to come from that growth. If that’s the case you don’t have to expect “alignment by default” but can think that “alignment on the present path” is plausible.
- awenonian 9 Jun 2022 14:51 UTC
  2 points
  0
  Parent
  To start, it’s possible to know facts with confidence, without all the relevant info. For example I can’t fit all the multiplication tables into my head, and I haven’t done the calculation, but I’m confident that 2143*1057 is greater than 2,000,000.
  Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.
  I believe the necessary knowledge to be confident in each of these facts is not too big to fit in a human brain.
  You may be referring to other things, which have similar paths to high confidence (e.g. “Why are you confident this alignment idea won’t work.” “I’ve poked holes in every alignment idea I’ve come across. At this point, Bayes tells me to expect new ideas not to work, so I need proof they will, not proof they won’t.”), but each path might be idea specific.
  - AnthonyC 12 Jun 2022 15:40 UTC
    3 points
    0
    Parent
    Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.
    I’m not sure if I’ve ever seen this stated explicitly, but this is essentially a thermodynamic argument. So to me, arguing against “alignment is hard” feels a lot like arguing “But why can’t this one be a perpetual motion machine of the second kind?” And the answer there is, “Ok fine, heat being spontaneously converted to work isn’t literally physically impossible, but the degree to which it is super-exponentially unlikely is greater than our puny human minds can really comprehend, and this is true for almost any set of laws of physics that might exist in any universe that can be said to have laws of physics at all.”
- silentbob 8 Jun 2022 15:57 UTC
  2 points
  0
  Parent
  In The Rationalists’ Guide to the Galaxy the author discusses the case of a chess game, and particularly when a strong chess player faces a much weaker one. In that case it’s very easy to make the prediction that the strong player will win with near certainty, even if you have no way to predict the intermediate steps. So there certainly are domains where (some) predictions are easy despite the world’s complexity.
  My personal rather uninformed take on the AI discussion is that many of the arguments are indeed comparable in a way to the chess example, so the predictions seem convincing despite the complexity involved. But even then they are based on certain assumptions about how AGI will work (e.g. that it will be some kind of optimization process with a value function), and I find these assumptions pretty intransparent. When hearing confident claims about AGI killing humanity, then even if the arguments make sense, “model uncertainty” comes to mind. But it’s hard to argue about that since it is unclear (to me) what the “model” actually is and how things could turn out different.
- Yonatan Cale 9 Jun 2022 21:49 UTC
  1 point
  0
  Parent
  Before taking Eliezer’s opinion into account—what are your priors? (and why?)
  For myself, I prefer to have my own opinion and not only to lean on expert predictions, if I can
  - Yonatan Cale 10 Jun 2022 8:50 UTC
    1 point
    0
    Parent
    To make the point that this argument depends a lot on how one phrases the question: “AGI is complicated and the universe is big, how is everyone so sure we won’t die?”
    I am not saying that my sentence above is a good argument, I’m saying it because it pushes my brain to actually figure out what is actually happening instead of creating priors about experts, and I hope it does the same for you
    (which is also why I love this post!)

Cookiecarver comments on AGI Safety FAQ /​ all-dumb-questions-allowed thread

Cookiecarver comments on AGI Safety FAQ / all-dumb-questions-allowed thread