Thanks for the thoughtful answer. After thinking about this stuff for a while, I think I’ve updated a bit towards thinking (i.a.) that alignment might not be quite as difficult as I previously believed.[1]
I still have a bunch of disagreements and questions, some of which I’ve written below. I’d be curious to hear your thoughts on them, if you feel like writing them up.
I.
I think there’s something like an almost-catch-22 in alignment. A simplified caricature of this catch-22:
In order to safely create powerful AI, we need it to be aligned.
But in order to align an AI, it needs to have certain capabilities (it needs to be able to understand humans, and/or to reason about its own alignedness, or etc.).
But an AI with such capabilities would probably already be dangerous.
Looking at the paragraphs quoted below,
I now realize that I have an additional assumption that I didn’t explicitly put in the post, which is something like… alignment and capabilities may be transmitted simultaneously. [...]
[...] a successive series of more sophisticated AI systems that get gradually better at understanding human preferences and being aligned with them (the way we went from GPT-1 to GPT-3.5)
I’m pattern-matching that as proposing that the almost-catch-22 is solvable by iteratively
1.) incrementing the AI’s capabilities a little bit
2.) using those improved capabilities to improve the AI’s alignedness (to the extent possible); goto (1.)
Does that sound like a reasonable description of what you were saying? If yes, I’m guessing that you believe sharp left turns are (very) unlikely?
I’m currently quite confused/uncertain about what happens (notably: whether something like sharp left turns happen) when training various kinds of (simulator-like) AIs to very high capabilities. But I feel kinda pessimistic about it being practically possible to implement and iteratively interleave steps (1.)-(2.) so as to produce a powerful aligned AGI. If you feel optimistic about it, I’m curious as to what makes you optimistic.(?)
II.
I don’t know how exactly we get from the part where the AI is modeling the human to the part where it actively wants to fulfill the human’s preferences [...]
I think {the stuff that paragraph points at} might contain a crux or two.
One possible crux:
[...] that drive shows up in human infants and animals, so I’d guess that it wouldn’t be very complicated. [2]
Whereas I think it would probably be quite difficult/complicated to specify a good version of D := {(a drive to) fulfill human H’s preferences, in a way H would reflectively endorse}. Partly because humans are inconsistent, manipulable, and possessed of rather limited powers of comprehension; I’d expect the outcome of an AI optimizing for D to depend a lot on things like
just what kind of reflective process the AI was simulating H to be performing
in what specific ways/order did the AI fulfill H’s various preferences
what information/experiences the AI caused H to have, while going about fulfilling H’s preferences.
I suspect one thing that might be giving people the intuition that {specifying a good version of D is easy} is the fact that {humans, dogs, and other agents on which people’s intuitions about PF are based} are very weak optimizers; even if such an agent had a flawed version of D, the outcome would still probably be pretty OK. But if you subject a flawed version of D to extremely powerful optimization pressure, I’d expect the outcome to be worse, possibly catastrophic.
And then there’s the issue of {how do you go about actually “programming” a good version of D in the AI’s ontology, and load that program into the AI, as the AI gains capabilities?}; see (I.).
Maybe something like it being rewarding if the predictive model finds actions that the adult/human seems to approve of.
I think this would run into all the classic problems arising from {rewarding proxies to what we actually care about}, no? Assuming that D := {understand human H’s preferences-under-reflection (whatever that even means), and try to fulfill those preferences} is in fact somewhat complex (note that crux again!), it seems very unlikely that training an AI with a reward function like R := {positive reward whenever H signals approval} would generalize to D. There seem to be much, much simpler (a priori more likely) goals Y to which the AI could generalize from R. E.g. something like Y := {maximize number of human-like entities that are expressing intense approval}.
III.
Side note: I’m weirded out by all the references to humans, raising human children, etc.
I think that kind of stuff is probably not practically relevant/useful for alignment; and that trying to align an ASI by… {making it “human-like” in some way} is likely to fail, and probably has poor hyperexistential separation to boot.
I don’t know if anyone’s interested in discussing that, though, so maybe better I don’t waste more space on that here. (?)
Side note: The reasoning step {drive X shows up in animals} → {X is probably simple} seems wrong to me. Like, Evolution is stupid, yes, but it’s had a lot of time to construct prosocial animals’ genomes; those genomes (and especially the brains they produce in interaction with huge amounts of sensory data) contain quite a lot of information/complexity.
I’m pattern-matching that as proposing that the almost-catch-22 is solvable by iteratively
1.) incrementing the AI’s capabilities a little bit
2.) using those improved capabilities to improve the AI’s alignedness (to the extent possible); goto (1.)
Does that sound like a reasonable description of what you were saying?
I think it might be at least a reasonable first approximation, yeah.
If yes, I’m guessing that you believe sharp left turns are (very) unlikely?
I wouldn’t be so confident as to say they’re very unlikely, but also not convinced that they’re very likely. I don’t have the energy to do a comprehensive analysis of the post right now, but here are some disagreements that I have with it:
“The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF...” I suspect this analogy is misleading, I touched upon some of the reasons for why in this comment. (I have a partially finished draft for a post that has a thesis that goes along the lines of “genetic fitness isn’t what evolution selects for, fitness is a measure of how strongly evolution selects for some other trait” but I need to check my reasoning / finish it.)
The post suggests that we might get an AI’s alignment properties up to some level, but at some point, its capabilities shoot up to such a point where those alignment properties aren’t enough to prevent us from all being killed. I think that if the preference fulfillment hypothesis is right, then “don’t kill the people whose preferences you’re trying to fulfill” is probably going to be a relatively basic alignment property (it’s impossible to fulfill the preferences of someone who is dead). So hopefully we should be able to lock that in before the AI’s capabilities get to the sharp left turn. (Though it’s still possible that the AI makes some more subtle mistake—say modifying people’s brains to make their preferences easier to optimize—than just killing everyone outright.)
“sliding down the capabilities well is liable to break a bunch of your existing alignment properties [...] things in the capabilities well have instrumental incentives that cut against your alignment patches”. This seems to assume that the AI has been built with a motivation system that does not primarily optimize for something like alignment. Rather the alignment has been achieved by “patches” on top of the existing motivation system. But if the AI’s sole motivation is just fulfilling human preferences, then the alignment doesn’t take the form of patches that are trying to combat its actual motivation.
I’d expect the outcome of an AI optimizing for D to depend a lot on things like
just what kind of reflective process the AI was simulating H to be performing
in what specific ways/order did the AI fulfill H’s various preferences
what information/experiences the AI caused H to have, while going about fulfilling H’s preferences.
Agree; these are the kinds of things that I mentioned as still being unsolved problems and highly culturally contingent, in my original post. Though it seems worth noting that if there aren’t objective right or wrong answers to these questions, then it implies that the AI can’t really get them wrong, either. Plausibly different approaches to fulfilling our preferences could lead to very different outcomes… but maybe we would be happy with any outcome we ultimately ended up at, since probably “are humans happy with the outcome” is going to be a major criterion with any preference fulfillment process.
I’m not certain of this. Maybe there are versions of D that the AI might end up on, where the humans are doing the equivalent of suffering horribly on the inside while pretending to be okay on the outside, and that looks to the AI as their preferences being fulfilled. But I’d guess that to be less likely than most versions of D actually genuinely caring about our happiness.
And then there’s the issue of {how do you go about actually “programming” a good version of D in the AI’s ontology, and load that program into the AI, as the AI gains capabilities?}; see (I.).
My reply to Steven (the waypart about how I think preference fulfillment might be “natural”) might be relevant here.
I think this would run into all the classic problems arising from {rewarding proxies to what we actually care about}, no?
It certainly sounds like it… but then somehow humans do manage to genuinely come to care about what makes other humans happy, so there seems to be some component (that might be “natural” in the sense that I described to Steven) that helps us avoid it.
Side note: I’m weirded out by all the references to humans, raising human children, etc. I think that kind of stuff is probably not practically relevant/useful for alignment;
About five to ten years ago, I would have shared that view. But more recently I’ve been shifting toward the view that maybe AIs are going to look and work relatively similarly to humans. Because if there’s a solution X for intelligence that is relatively easy to discover, then it might be that both early-stage AI researchers and evolution will hit upon that solution first: exactly because it’s the easiest solution to discover. And also, humans are the one example we have of intelligence that’s at least somewhat aligned with human the species, so that seems to be the place where we should be looking at for solutions.
I agree with some parts of what (I think) you’re saying; but I think I disagree with a lot of it. My thoughts here are still blurry/confused, though; will need to digest this stuff further. Thanks!
Thanks for the thoughtful answer. After thinking about this stuff for a while, I think I’ve updated a bit towards thinking (i.a.) that alignment might not be quite as difficult as I previously believed.[1]
I still have a bunch of disagreements and questions, some of which I’ve written below. I’d be curious to hear your thoughts on them, if you feel like writing them up.
I.
I think there’s something like an almost-catch-22 in alignment. A simplified caricature of this catch-22:
In order to safely create powerful AI, we need it to be aligned.
But in order to align an AI, it needs to have certain capabilities (it needs to be able to understand humans, and/or to reason about its own alignedness, or etc.).
But an AI with such capabilities would probably already be dangerous.
Looking at the paragraphs quoted below,
I’m pattern-matching that as proposing that the almost-catch-22 is solvable by iteratively
1.) incrementing the AI’s capabilities a little bit
2.) using those improved capabilities to improve the AI’s alignedness (to the extent possible); goto (1.)
Does that sound like a reasonable description of what you were saying? If yes, I’m guessing that you believe sharp left turns are (very) unlikely?
I’m currently quite confused/uncertain about what happens (notably: whether something like sharp left turns happen) when training various kinds of (simulator-like) AIs to very high capabilities. But I feel kinda pessimistic about it being practically possible to implement and iteratively interleave steps (1.)-(2.) so as to produce a powerful aligned AGI. If you feel optimistic about it, I’m curious as to what makes you optimistic.(?)
II.
I think {the stuff that paragraph points at} might contain a crux or two.
One possible crux:
Whereas I think it would probably be quite difficult/complicated to specify a good version of D := {(a drive to) fulfill human H’s preferences, in a way H would reflectively endorse}. Partly because humans are inconsistent, manipulable, and possessed of rather limited powers of comprehension; I’d expect the outcome of an AI optimizing for D to depend a lot on things like
just what kind of reflective process the AI was simulating H to be performing
in what specific ways/order did the AI fulfill H’s various preferences
what information/experiences the AI caused H to have, while going about fulfilling H’s preferences.
I suspect one thing that might be giving people the intuition that {specifying a good version of D is easy} is the fact that {humans, dogs, and other agents on which people’s intuitions about PF are based} are very weak optimizers; even if such an agent had a flawed version of D, the outcome would still probably be pretty OK. But if you subject a flawed version of D to extremely powerful optimization pressure, I’d expect the outcome to be worse, possibly catastrophic.
And then there’s the issue of {how do you go about actually “programming” a good version of D in the AI’s ontology, and load that program into the AI, as the AI gains capabilities?}; see (I.).
I think this would run into all the classic problems arising from {rewarding proxies to what we actually care about}, no? Assuming that D := {understand human H’s preferences-under-reflection (whatever that even means), and try to fulfill those preferences} is in fact somewhat complex (note that crux again!), it seems very unlikely that training an AI with a reward function like R := {positive reward whenever H signals approval} would generalize to D. There seem to be much, much simpler (a priori more likely) goals Y to which the AI could generalize from R. E.g. something like Y := {maximize number of human-like entities that are expressing intense approval}.
III.
Side note: I’m weirded out by all the references to humans, raising human children, etc. I think that kind of stuff is probably not practically relevant/useful for alignment; and that trying to align an ASI by… {making it “human-like” in some way} is likely to fail, and probably has poor hyperexistential separation to boot. I don’t know if anyone’s interested in discussing that, though, so maybe better I don’t waste more space on that here. (?)
To put some ass-numbers on it, I think I’m going from something like
85% death (or worse) from {un/mis}aligned AI
10% some humans align AI and become corrupted, Weak Dystopia ensues
5% AI is aligned by humans with something like security mindset and cosmopolitan values, things go very well
to something like
80% death (or worse) from {un/mis}aligned AI
13% some humans align AI and become corrupted, Weak Dystopia ensues
7% AI is aligned by humans with something like security mindset and cosmopolitan values, things go very well
Side note: The reasoning step {drive X shows up in animals} → {X is probably simple} seems wrong to me. Like, Evolution is stupid, yes, but it’s had a lot of time to construct prosocial animals’ genomes; those genomes (and especially the brains they produce in interaction with huge amounts of sensory data) contain quite a lot of information/complexity.
I think it might be at least a reasonable first approximation, yeah.
I wouldn’t be so confident as to say they’re very unlikely, but also not convinced that they’re very likely. I don’t have the energy to do a comprehensive analysis of the post right now, but here are some disagreements that I have with it:
“The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF...” I suspect this analogy is misleading, I touched upon some of the reasons for why in this comment. (I have a partially finished draft for a post that has a thesis that goes along the lines of “genetic fitness isn’t what evolution selects for, fitness is a measure of how strongly evolution selects for some other trait” but I need to check my reasoning / finish it.)
The post suggests that we might get an AI’s alignment properties up to some level, but at some point, its capabilities shoot up to such a point where those alignment properties aren’t enough to prevent us from all being killed. I think that if the preference fulfillment hypothesis is right, then “don’t kill the people whose preferences you’re trying to fulfill” is probably going to be a relatively basic alignment property (it’s impossible to fulfill the preferences of someone who is dead). So hopefully we should be able to lock that in before the AI’s capabilities get to the sharp left turn. (Though it’s still possible that the AI makes some more subtle mistake—say modifying people’s brains to make their preferences easier to optimize—than just killing everyone outright.)
“sliding down the capabilities well is liable to break a bunch of your existing alignment properties [...] things in the capabilities well have instrumental incentives that cut against your alignment patches”. This seems to assume that the AI has been built with a motivation system that does not primarily optimize for something like alignment. Rather the alignment has been achieved by “patches” on top of the existing motivation system. But if the AI’s sole motivation is just fulfilling human preferences, then the alignment doesn’t take the form of patches that are trying to combat its actual motivation.
Agree; these are the kinds of things that I mentioned as still being unsolved problems and highly culturally contingent, in my original post. Though it seems worth noting that if there aren’t objective right or wrong answers to these questions, then it implies that the AI can’t really get them wrong, either. Plausibly different approaches to fulfilling our preferences could lead to very different outcomes… but maybe we would be happy with any outcome we ultimately ended up at, since probably “are humans happy with the outcome” is going to be a major criterion with any preference fulfillment process.
I’m not certain of this. Maybe there are versions of D that the AI might end up on, where the humans are doing the equivalent of suffering horribly on the inside while pretending to be okay on the outside, and that looks to the AI as their preferences being fulfilled. But I’d guess that to be less likely than most versions of D actually genuinely caring about our happiness.
My reply to Steven (the way part about how I think preference fulfillment might be “natural”) might be relevant here.
It certainly sounds like it… but then somehow humans do manage to genuinely come to care about what makes other humans happy, so there seems to be some component (that might be “natural” in the sense that I described to Steven) that helps us avoid it.
About five to ten years ago, I would have shared that view. But more recently I’ve been shifting toward the view that maybe AIs are going to look and work relatively similarly to humans. Because if there’s a solution X for intelligence that is relatively easy to discover, then it might be that both early-stage AI researchers and evolution will hit upon that solution first: exactly because it’s the easiest solution to discover. And also, humans are the one example we have of intelligence that’s at least somewhat aligned with human the species, so that seems to be the place where we should be looking at for solutions.
See also “Humans provide an untapped wealth of evidence about alignment”, which I largely agree with.
I agree with some parts of what (I think) you’re saying; but I think I disagree with a lot of it. My thoughts here are still blurry/confused, though; will need to digest this stuff further. Thanks!