A not entirely aligned AI could still be valuable and helpful. It’s not inevitable such entities will turn into omnicidal maniacs.
I think we know this, it’s just that most not-entirely aligned AIs will.
Plenty of ‘failed utopia’-type outcomes that aren’t exactly what we would ideally want would still be pretty great, but the chances of hitting them by accident are very low.
“Plenty of ‘failed utopia’-type outcomes that aren’t exactly what we would ideally want would still be pretty great, but the chances of hitting them by accident are very low.”
I’m assuming you’ve read Eliezer’s post “Failed Utopia 4-2”, since you use the expression? I’ve actually been thinking a lot about that, how that specific “failed utopia” wasn’t really that bad. In fact it was even much better than the current world, as disease and aging and I’m assuming violence too got all solved at the cost of all families being separated for a few decades, which is a pretty good trade if you ask me. It makes me think if there’s some utility function with unaligned AI that could lead to some kind of nice future, like “don’t kill anyone and don’t cause harm/suffering to anyone”. The truth is that in stories of genies the wishes are always very ambiguous, so a “wish” stated negatively (don’t do this) might lead to less ambiguity than one stated positively (do that).
But this is even assuming that it will be possible to give utility functions to advanced AI, which I’ve heard some people say it won’t.
This also plays into Stuart Russell’s view. His approach seems much more simple than alignment, it’s just in short not letting the advanced AI know its final objective. It makes me think whether there could be solutions to the advanced AI problem that would be more tractable than the intractable alignment.
assuming you’ve read Eliezer’s post “Failed Utopia 4-2”
Indeed I have, although I don’t remember the details, but I think it’s an example of things going very well indeed but not quite perfectly. Certainly if I could press a button today to cause that future I would.
assuming violence too got all solved
I do hope not! I love violence and would hope that Mars is an utter bloodbath. Of course I would like my shattered fragments to knit themselves back together quickly enough that I can go drinking with my enemies and congratulate them on their victory before sloping home to my catgirl-slave-harem. And of course it would be no fun at all if we hadn’t solved hangovers, I would like to be fresh and enthusiastic for tomorrow’s bloodbath. Or maybe cricket. Or maths olympiad.
Venus probably works differently.
don’t kill anyone and don’t cause harm/suffering to anyone
The problem with this one is that the AI’s optimal move is to cease to exist.
And that’s already relying on being able to say what ‘kill someone’ means in a sufficiently clear way that it will satisfy computer programmers, which is much harder than satisfying philosophers or lawyers.
For instance, when Captain Kirk transports down to the Planet-of-Hats, did he just die when he was disassembled, and then get reborn? Do we need to know how the transporter works to say?
But this is even assuming that it will be possible to give utility functions to advanced AI, which I’ve heard some people say it won’t.
I actually wonder if it’s possible not to give a utility function to a rational agent, since it can notice loops in its desires and eliminate them. For instance I like to buy cigars and like to smoke them, and at the end of this little loop, I’ve got less money and less health, and I’d like to go back to the previous state where I was healthier and had the price of a packet of fags.
That loop means that I don’t have a utility function, but if I could modify my own mind I’d happily get rid of the loop.
I think that means that any mind that notices it has circular preferences has the possibity to get rid, and so it will eventually turn itself into a utility-function type rational agent.
The problem is to give the damned things the utility function you actually want them to have, rather than something cobbled together out of whatever program they started off as.
This also plays into Stuart Russell’s view. His approach seems much more simple than alignment, it’s just in short not letting the advanced AI know its final objective. It makes me think whether there could be solutions to the advanced AI problem that would be more tractable than the intractable alignment.
Stuart Russell is a very clever man, and if his approach to finessing the alignment problem can be made to work then that’s the best news ever, go Stuart!
But I am a little sceptical because it does seem short on details, and the main worry is that before he can get anywhere, some fool is going to create an unaligned AI, and then we are all dead.
Perhaps it’s not that difficult after all.
It’s still possible that there are a couple of clever hacks that will fix everything, and people are still looking, so there’s hope. What’s changed recently is that it’s suddenly looking like AI is really not very hard at all.
We already knew that it wasn’t, because of evolution, but it’s scarier when you see that the thing that obviously has to be true look like it might actually be true.
Whereas alignment looks harder and harder the more we learn about it.
So now the problem is not ‘can this be done if we think of some clever hacks’, it’s ‘can this be done before this other thing that’s really easy and that people are spending trillions on’. Like a couple of weirdo misfits trying to work out a nuclear bomb proof umbrella in a garage in Hiroshima while the American government is already half way through the Manhattan project.
Eliezer is also a very clever man, and he started out really optimistic, and he and lots of other people have been thinking about this quite hard for a long time, and now he is not so optimistic.
I think that that must be because a lot of the obvious routes to not destroying the entire universe are blocked, and I am pretty sure that the author of ‘Failed Utopia 4-2’ has at least considered the possibility that it might not be so bad if we only get it 99%-right.
“I love violence and would hope that Mars is an utter bloodbath.”
The problem is that biological violence hurts like hell. Even most athletes live with chronic pain, imagine most warriors. Naturally we could solve the pain part, but then it wouldn’t be the violence I’m referring to. It would be videogame violence, which I’m ok with since it doesn’t cause pain or injury or death. But don’t worry, I still got the joke!
“”don’t kill anyone and don’t cause harm/suffering to anyone”
The problem with this one is that the AI’s optimal move is to cease to exist.”
I’ve thought about it as well. Big brain idea: perhaps the first AGI’s utility function would be to act in the real world as minimally as possible, maybe only with the goal of preventing other people from developing AGI, keeping like this until we solve alignment? Of course this latter part of policing the world would be already prone to a lot of ambiguity and sophism, but again, if we program do not’s (do not let anyone else build AGI, plus do not kill anyone, plus do not cause suffering, etc) instead of do’s, it could lead to a lot less ambiguity and sophism, by drastically curtailing the maneuver space. (Not that I’m saying that it would be easy.) As opposed to like “cure cancer”, or “build an Eiffel tower”.
“And that’s already relying on being able to say what ‘kill someone’ means in a sufficiently clear way that it will satisfy computer programmers”
I don’t think so. When the brain irreversibly stops you’re dead. It’s clear. This plays into my doubt that perhaps we keep underestimating the intelligence of a superintelligence. I think that even current AIs could be made to discern when a person is dead or alive, perhaps even better than us already.
“For instance, when Captain Kirk transports down to the Planet-of-Hats, did he just die when he was disassembled, and then get reborn? Do we need to know how the transporter works to say?”
Maybe don’t teletransport anyone until we’ve figured that out? There the problem is teletransportation itself, not AGI efficiently recognizing what is death at least as well as we do. (But I’d venture saying that it could even solve that philosophical problem, since it’s smarter than us.)
“Stuart Russell is a very clever man, and if his approach to finessing the alignment problem can be made to work then that’s the best news ever, go Stuart!
But I am a little sceptical because it does seem short on details, and the main worry is that before he can get anywhere, some fool is going to create an unaligned AI, and then we are all dead.”
I gotta admit that I completely agree.
“Whereas alignment looks harder and harder the more we learn about it.”
I’ll say that I’m not that much convinced about most of this that I’ve said. I’m still way more to the side that “control is super difficult and we’re all gonna die (or worse)”. But I keep thinking about these things, to see if maybe there’s a “way out”. Maybe we in this community have built a bias that “only the mega difficult value alignment will work”, when it could be false. Maybe it’s not just “clever hacks”, maybe there are simply more efficient and tractable ways to control advanced AI than the intractable value alignment. But again, I’m not even that much convinced myself.
“and I am pretty sure that the author of ‘Failed Utopia 4-2’ has at least considered the possibility that it might not be so bad if we only get it 99%-right.”
Exactly. Again, perhaps there are much more tractable ways than 100% alignment. OR, if we could at least solve worst-case AI safety (that is, prevent s-risk) it would already be a massive win.
do not let anyone else build AGI, plus do not kill anyone, plus do not cause suffering, etc
your problem here is that a good move for the AI is now to anaesthetize everyone, but keep them alive although unconscious until they die naturally.
act in the real world as minimally as possible
I think this might have been one of MIRI’s ideas, but it turns out to be tricky to define what it means. I can’t think what they called it so I can’t find it, but someone will know.
Maybe don’t teletransport anyone until we’ve figured that out?
There may not actually be an answer! I had thought planning for cryonic preservation was a good idea since I was a little boy.
But I found that Eliezer’s arguments in favour of cryonics actually worked backwards on me, and caused me to abandon my previous ideas about what death is and whether I care about entities in the future that remember being me or how many of them there are.
Luckily all that’s replaced them is a vast confusion so I do still have a smoke alarm. Otherwise I ignore the whole problem, go on as usual and don’t bother with cryonics because I’m not anticipating making it to the point of natural death anyway.
OR, if we could at least solve worst-case AI safety (that is, prevent s-risk) it would already be a massive win.
Easy! Build a paperclipper, it kills everyone. We don’t even need to bother doing this, plenty of well funded clever people are working very hard on it on our behalf.
When the brain irreversibly stops you’re dead. It’s clear.
Your problem here is ‘irreversible’, and ‘stops’. How about just slowing it down a really lot?
The problem is that biological violence hurts like hell.
No problem there, I loved rugby and cricket, and they hurt a lot. I’m no masochist! Overcoming the fear and pain and playing anyway is part of the point. What I don’t like is irreversible damage. I have various lifelong injuries (mostly from rugby and cricket...), and various effects of aging preventing me from playing, but if they could be fixed I’d be straight back out there.
But cricket and rugby are no substitute for war, which is what they’re trying to be. And on Mars all injuries heal roughly at the point the pubs open.
have built a bias that “only the mega difficult value alignment will work”
I don’t think so. I think we’d settle for “anything that does better than everyone’s dead”. The problem is that most of the problems look fundamental. If you can do even slightly better than “everyone’s dead”, you can probably solve the whole thing (and build a friendly paperclipper that fills the universe with awesomeness).
So if you do end up coming up with something even slightly better than “everyone’s dead”, do let us know.
I think a lot of the obvious ideas have been thought of before, but I think even then there might still be mileage in making top-level posts about ideas here and letting people take pot-shots at them.
There may well be a nice clear obvious solution to the alignment problem which will make everyone feel a bit silly in retrospect.
It would be ever so undignified if we didn’t think of it because we were convinced we’d already tried everything.
I think we know this, it’s just that most not-entirely aligned AIs will.
Plenty of ‘failed utopia’-type outcomes that aren’t exactly what we would ideally want would still be pretty great, but the chances of hitting them by accident are very low.
“Plenty of ‘failed utopia’-type outcomes that aren’t exactly what we would ideally want would still be pretty great, but the chances of hitting them by accident are very low.”
I’m assuming you’ve read Eliezer’s post “Failed Utopia 4-2”, since you use the expression? I’ve actually been thinking a lot about that, how that specific “failed utopia” wasn’t really that bad. In fact it was even much better than the current world, as disease and aging and I’m assuming violence too got all solved at the cost of all families being separated for a few decades, which is a pretty good trade if you ask me. It makes me think if there’s some utility function with unaligned AI that could lead to some kind of nice future, like “don’t kill anyone and don’t cause harm/suffering to anyone”. The truth is that in stories of genies the wishes are always very ambiguous, so a “wish” stated negatively (don’t do this) might lead to less ambiguity than one stated positively (do that).
But this is even assuming that it will be possible to give utility functions to advanced AI, which I’ve heard some people say it won’t.
This also plays into Stuart Russell’s view. His approach seems much more simple than alignment, it’s just in short not letting the advanced AI know its final objective. It makes me think whether there could be solutions to the advanced AI problem that would be more tractable than the intractable alignment.
Perhaps it’s not that difficult after all.
Indeed I have, although I don’t remember the details, but I think it’s an example of things going very well indeed but not quite perfectly. Certainly if I could press a button today to cause that future I would.
I do hope not! I love violence and would hope that Mars is an utter bloodbath. Of course I would like my shattered fragments to knit themselves back together quickly enough that I can go drinking with my enemies and congratulate them on their victory before sloping home to my catgirl-slave-harem. And of course it would be no fun at all if we hadn’t solved hangovers, I would like to be fresh and enthusiastic for tomorrow’s bloodbath. Or maybe cricket. Or maths olympiad.
Venus probably works differently.
The problem with this one is that the AI’s optimal move is to cease to exist.
And that’s already relying on being able to say what ‘kill someone’ means in a sufficiently clear way that it will satisfy computer programmers, which is much harder than satisfying philosophers or lawyers.
For instance, when Captain Kirk transports down to the Planet-of-Hats, did he just die when he was disassembled, and then get reborn? Do we need to know how the transporter works to say?
I actually wonder if it’s possible not to give a utility function to a rational agent, since it can notice loops in its desires and eliminate them. For instance I like to buy cigars and like to smoke them, and at the end of this little loop, I’ve got less money and less health, and I’d like to go back to the previous state where I was healthier and had the price of a packet of fags.
That loop means that I don’t have a utility function, but if I could modify my own mind I’d happily get rid of the loop.
I think that means that any mind that notices it has circular preferences has the possibity to get rid, and so it will eventually turn itself into a utility-function type rational agent.
The problem is to give the damned things the utility function you actually want them to have, rather than something cobbled together out of whatever program they started off as.
Stuart Russell is a very clever man, and if his approach to finessing the alignment problem can be made to work then that’s the best news ever, go Stuart!
But I am a little sceptical because it does seem short on details, and the main worry is that before he can get anywhere, some fool is going to create an unaligned AI, and then we are all dead.
It’s still possible that there are a couple of clever hacks that will fix everything, and people are still looking, so there’s hope. What’s changed recently is that it’s suddenly looking like AI is really not very hard at all.
We already knew that it wasn’t, because of evolution, but it’s scarier when you see that the thing that obviously has to be true look like it might actually be true.
Whereas alignment looks harder and harder the more we learn about it.
So now the problem is not ‘can this be done if we think of some clever hacks’, it’s ‘can this be done before this other thing that’s really easy and that people are spending trillions on’. Like a couple of weirdo misfits trying to work out a nuclear bomb proof umbrella in a garage in Hiroshima while the American government is already half way through the Manhattan project.
Eliezer is also a very clever man, and he started out really optimistic, and he and lots of other people have been thinking about this quite hard for a long time, and now he is not so optimistic.
I think that that must be because a lot of the obvious routes to not destroying the entire universe are blocked, and I am pretty sure that the author of ‘Failed Utopia 4-2’ has at least considered the possibility that it might not be so bad if we only get it 99%-right.
“I love violence and would hope that Mars is an utter bloodbath.”
The problem is that biological violence hurts like hell. Even most athletes live with chronic pain, imagine most warriors. Naturally we could solve the pain part, but then it wouldn’t be the violence I’m referring to. It would be videogame violence, which I’m ok with since it doesn’t cause pain or injury or death. But don’t worry, I still got the joke!
“”don’t kill anyone and don’t cause harm/suffering to anyone”
The problem with this one is that the AI’s optimal move is to cease to exist.”
I’ve thought about it as well. Big brain idea: perhaps the first AGI’s utility function would be to act in the real world as minimally as possible, maybe only with the goal of preventing other people from developing AGI, keeping like this until we solve alignment? Of course this latter part of policing the world would be already prone to a lot of ambiguity and sophism, but again, if we program do not’s (do not let anyone else build AGI, plus do not kill anyone, plus do not cause suffering, etc) instead of do’s, it could lead to a lot less ambiguity and sophism, by drastically curtailing the maneuver space. (Not that I’m saying that it would be easy.) As opposed to like “cure cancer”, or “build an Eiffel tower”.
“And that’s already relying on being able to say what ‘kill someone’ means in a sufficiently clear way that it will satisfy computer programmers”
I don’t think so. When the brain irreversibly stops you’re dead. It’s clear. This plays into my doubt that perhaps we keep underestimating the intelligence of a superintelligence. I think that even current AIs could be made to discern when a person is dead or alive, perhaps even better than us already.
“For instance, when Captain Kirk transports down to the Planet-of-Hats, did he just die when he was disassembled, and then get reborn? Do we need to know how the transporter works to say?”
Maybe don’t teletransport anyone until we’ve figured that out? There the problem is teletransportation itself, not AGI efficiently recognizing what is death at least as well as we do. (But I’d venture saying that it could even solve that philosophical problem, since it’s smarter than us.)
“Stuart Russell is a very clever man, and if his approach to finessing the alignment problem can be made to work then that’s the best news ever, go Stuart!
But I am a little sceptical because it does seem short on details, and the main worry is that before he can get anywhere, some fool is going to create an unaligned AI, and then we are all dead.”
I gotta admit that I completely agree.
“Whereas alignment looks harder and harder the more we learn about it.”
I’ll say that I’m not that much convinced about most of this that I’ve said. I’m still way more to the side that “control is super difficult and we’re all gonna die (or worse)”. But I keep thinking about these things, to see if maybe there’s a “way out”. Maybe we in this community have built a bias that “only the mega difficult value alignment will work”, when it could be false. Maybe it’s not just “clever hacks”, maybe there are simply more efficient and tractable ways to control advanced AI than the intractable value alignment. But again, I’m not even that much convinced myself.
“and I am pretty sure that the author of ‘Failed Utopia 4-2’ has at least considered the possibility that it might not be so bad if we only get it 99%-right.”
Exactly. Again, perhaps there are much more tractable ways than 100% alignment. OR, if we could at least solve worst-case AI safety (that is, prevent s-risk) it would already be a massive win.
your problem here is that a good move for the AI is now to anaesthetize everyone, but keep them alive although unconscious until they die naturally.
I think this might have been one of MIRI’s ideas, but it turns out to be tricky to define what it means. I can’t think what they called it so I can’t find it, but someone will know.
There may not actually be an answer! I had thought planning for cryonic preservation was a good idea since I was a little boy.
But I found that Eliezer’s arguments in favour of cryonics actually worked backwards on me, and caused me to abandon my previous ideas about what death is and whether I care about entities in the future that remember being me or how many of them there are.
Luckily all that’s replaced them is a vast confusion so I do still have a smoke alarm. Otherwise I ignore the whole problem, go on as usual and don’t bother with cryonics because I’m not anticipating making it to the point of natural death anyway.
Easy! Build a paperclipper, it kills everyone. We don’t even need to bother doing this, plenty of well funded clever people are working very hard on it on our behalf.
Your problem here is ‘irreversible’, and ‘stops’. How about just slowing it down a really lot?
No problem there, I loved rugby and cricket, and they hurt a lot. I’m no masochist! Overcoming the fear and pain and playing anyway is part of the point. What I don’t like is irreversible damage. I have various lifelong injuries (mostly from rugby and cricket...), and various effects of aging preventing me from playing, but if they could be fixed I’d be straight back out there.
But cricket and rugby are no substitute for war, which is what they’re trying to be. And on Mars all injuries heal roughly at the point the pubs open.
I don’t think so. I think we’d settle for “anything that does better than everyone’s dead”. The problem is that most of the problems look fundamental. If you can do even slightly better than “everyone’s dead”, you can probably solve the whole thing (and build a friendly paperclipper that fills the universe with awesomeness).
So if you do end up coming up with something even slightly better than “everyone’s dead”, do let us know.
I think a lot of the obvious ideas have been thought of before, but I think even then there might still be mileage in making top-level posts about ideas here and letting people take pot-shots at them.
There may well be a nice clear obvious solution to the alignment problem which will make everyone feel a bit silly in retrospect.
It would be ever so undignified if we didn’t think of it because we were convinced we’d already tried everything.