Mostly the newer posts that you’re reading are not aiming to come up with The One True Encoding Of Human Values, which is why people don’t talk about these problems in relation to them. Rather, the hope is to create an AI system that does the specific things we ask of it, but ensures that we remain in control (see discussion of “corrigibility”). Such an AI system need not know The One True Encoding Of Human Values,
I don’t understand why anybody would want anything that involved leaving humans in control, unless there were absolutely no alternative whatsoever.
I’m not joking or being hyperbolic; I genuinely don’t get it. A lot of people seem to think that humans being in control is obviously good, but it seems really, really obvious to me that it’s a likely path to horrible outcomes.
Humans haven’t had access to all that much power for all that long, and we’ve already managed to create a number of conditions that look unstable and likely to go bad in catastrophic ways.
We’re on a climate slide to who-knows-where. The rest of the environment isn’t looking that good either. We’ve managed to avoid large-scale nuclear war for like 75 whole years after developing the capability, but that’s not remotely long enough to call “stable”. Those same 75 years have seen some reduction in war in general, but that looks like it’s turning around as the political system evolves. Most human governments (and other institutions) are distinctly suboptimal on a bunch of axes, including willingness to take crazy risks, and, although you can argue that they’ve gotten better in maybe the last 100 to 150 years, a large number of them now seem to have stopped getting better and started getting worse. Humans in general are systematically rotten to each other, and most of the advancement we’ve gotten against that to come from probably unsustainable institutional tricks that limit anybody’s ability to get the decisive upper hand.
If you gave humans control over more power, then why wouldn’t you expect all of that to get even worse? And even if you could find a way to make such a situation stably not-so-bad, how would you manage the transition, where some humans would have more power than others, and all humans, including the currently advantaged ones, would feel threatened?
It seems to me that the obvious assumption is that humans being in control is bad. And trying to think out the mechanics of actual scenarios hasn’t done anything to change that belief. How can anybody believe otherwise?
There’s a difference between “AI putting humans in control is bad”, and “AI putting humans in control is better than other options we seem to have for alignment.” For many people, it may be as you mentioned:
I don’t understand why anybody would want anything that involved leaving humans in control, unless there were absolutely no alternative whatsoever.
(I’m somewhat less pessimistic than you are, I think, but I agree it could go pretty damn poorly, for many ways the AI could “leave us in control.”)
I don’t have an alternative, and no I’m not very happy about that. I definitely don’t know how to build a friendly AI. But, on the other hand, I don’t see how “corrigibility” could work either, so in that sense they’re on an equal footing. Nobody seems to have any real idea how to achieve either one, so why would you want to emphasize the one that seems less likely to lead to a non-sucky world?
Anyway, what I’m reacting to is this sense I get that some people assume that keeping humans in charge is good, and that humans not being in charge is in itself an unacceptable outcome, or at least weighs very heavily against the desirability of an outcome. I don’t know if I’ve seen very many people say that, but I see lots of things that seem to assume it. Things people write seem to start out with “If we want to make sure humans are still in charge, then...”, like that’s the primary goal. And I do not think it should be a primary goal. Not even a goal at all, actually.
Nobody seems to have any real idea how to achieve either one
I think that’s not true and we in fact have a much better idea of how to achieve corrigibility / intent alignment. (Not going to defend that here. You could see my comment here, though that one only argues why it might be easier rather than providing a method.)
Others will disagree with me on this.
humans not being in charge is in itself an unacceptable outcome, or at least weighs very heavily against the desirability of an outcome
The usual argument I’d give is “if humans aren’t in charge, then we can’t course correct if something goes wrong”. It’s instrumental, not terminal. If we ended up in a world like this where humans were not in charge, that seems like it could be okay depending on the details.
Another possibility is Posthuman Technocapital Singularity, everything goes in the same approximate direction, there are a lot of competing agents but without sharp destabilization or power concertation, and Moloch wins. Probably wins, idk
CEV is rather complicated and meta and hence not intended as something you’d do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)
So MIRI doesn’t focus on CEV, etc. because the world hasn’t nailed down step one yet. We’re extremely worried that humanity’s on track to fail step one; and it doesn’t matter how well we do on step two if we don’t pull off step one. That doesn’t mean that stopping at step one and never shooting for anything more ambitious would be acceptable; by default I’d consider that an existential catastrophe in its own right.
Yeah, CEV itself seemed like a long shot—but my thought process was that maintaining human control wouldn’t be enough for step one, both because I think it’s not enough at the limit, but also because the human component might inherently be a limiting factor that makes it not very competitive. But the more I thought about it, the weaker that assumption of inherent-ness seemed, so I agree in that the most this post could be saying is that the timeline gap between something like Task AGI and figuring out step 2 is short—but which I expect isn’t very groundbreaking.
Also ,there’s no proof that CEV would work. Maybe values are incoherent.
The arbital article is no help.
Asking what everyone would want* if they knew what the AI knew, and doing what they’d all predictably agree on, is just about the least jerky thing you can do.
How do we know that they would agree? That just begs the question. Saying that you shouldn’t be “jerky”, ie. selfish, doesn’t tell you what kind of unselfishness to have instead. Clearly ,the left and the right don’t agree on the best kind of altruism—laying down your life to stop the spread of socialism, versus sacrificing your income to implement socialism.
So if I’m understanding it correctly, it’s that maintaining human control is the best option we can formally work toward? The One True Encoding of Human Values would most likely be a more reliable system if we could create it, but that it’s a much harder problem, and not strictly necessary for a good end outcome?
The One True Encoding of Human Values would most likely be a more reliable system if we could create it
My best guess is that this is a confused claim and I have trouble saying “yes” or “no” to it, but I do agree with the spirit of it.
but that it’s a much harder problem, and not strictly necessary for a good end outcome?
Yes, in the same way that if you’re worried about mosquitoes giving you malaria, you start by getting a mosquito net or moving to a place without mosquitoes, you don’t immediately try to kill all mosquitoes in the world.
Mostly the newer posts that you’re reading are not aiming to come up with The One True Encoding Of Human Values, which is why people don’t talk about these problems in relation to them. Rather, the hope is to create an AI system that does the specific things we ask of it, but ensures that we remain in control (see discussion of “corrigibility”). Such an AI system need not know The One True Encoding Of Human Values,
I don’t understand why anybody would want anything that involved leaving humans in control, unless there were absolutely no alternative whatsoever.
I’m not joking or being hyperbolic; I genuinely don’t get it. A lot of people seem to think that humans being in control is obviously good, but it seems really, really obvious to me that it’s a likely path to horrible outcomes.
Humans haven’t had access to all that much power for all that long, and we’ve already managed to create a number of conditions that look unstable and likely to go bad in catastrophic ways.
We’re on a climate slide to who-knows-where. The rest of the environment isn’t looking that good either. We’ve managed to avoid large-scale nuclear war for like 75 whole years after developing the capability, but that’s not remotely long enough to call “stable”. Those same 75 years have seen some reduction in war in general, but that looks like it’s turning around as the political system evolves. Most human governments (and other institutions) are distinctly suboptimal on a bunch of axes, including willingness to take crazy risks, and, although you can argue that they’ve gotten better in maybe the last 100 to 150 years, a large number of them now seem to have stopped getting better and started getting worse. Humans in general are systematically rotten to each other, and most of the advancement we’ve gotten against that to come from probably unsustainable institutional tricks that limit anybody’s ability to get the decisive upper hand.
If you gave humans control over more power, then why wouldn’t you expect all of that to get even worse? And even if you could find a way to make such a situation stably not-so-bad, how would you manage the transition, where some humans would have more power than others, and all humans, including the currently advantaged ones, would feel threatened?
It seems to me that the obvious assumption is that humans being in control is bad. And trying to think out the mechanics of actual scenarios hasn’t done anything to change that belief. How can anybody believe otherwise?
There’s a difference between “AI putting humans in control is bad”, and “AI putting humans in control is better than other options we seem to have for alignment.” For many people, it may be as you mentioned:
(I’m somewhat less pessimistic than you are, I think, but I agree it could go pretty damn poorly, for many ways the AI could “leave us in control.”)
What TurnTrout said. What’s the alternative to which you’re comparing?
I don’t have an alternative, and no I’m not very happy about that. I definitely don’t know how to build a friendly AI. But, on the other hand, I don’t see how “corrigibility” could work either, so in that sense they’re on an equal footing. Nobody seems to have any real idea how to achieve either one, so why would you want to emphasize the one that seems less likely to lead to a non-sucky world?
Anyway, what I’m reacting to is this sense I get that some people assume that keeping humans in charge is good, and that humans not being in charge is in itself an unacceptable outcome, or at least weighs very heavily against the desirability of an outcome. I don’t know if I’ve seen very many people say that, but I see lots of things that seem to assume it. Things people write seem to start out with “If we want to make sure humans are still in charge, then...”, like that’s the primary goal. And I do not think it should be a primary goal. Not even a goal at all, actually.
I think that’s not true and we in fact have a much better idea of how to achieve corrigibility / intent alignment. (Not going to defend that here. You could see my comment here, though that one only argues why it might be easier rather than providing a method.)
Others will disagree with me on this.
The usual argument I’d give is “if humans aren’t in charge, then we can’t course correct if something goes wrong”. It’s instrumental, not terminal. If we ended up in a world like this where humans were not in charge, that seems like it could be okay depending on the details.
Another possibility is Posthuman Technocapital Singularity, everything goes in the same approximate direction, there are a lot of competing agents but without sharp destabilization or power concertation, and Moloch wins. Probably wins, idk
https://docs.osmarks.net/hypha/posthuman_technocapital_singularity
I consider the Arbital article on CEV the best reference for the topic. It says:
So MIRI doesn’t focus on CEV, etc. because the world hasn’t nailed down step one yet. We’re extremely worried that humanity’s on track to fail step one; and it doesn’t matter how well we do on step two if we don’t pull off step one. That doesn’t mean that stopping at step one and never shooting for anything more ambitious would be acceptable; by default I’d consider that an existential catastrophe in its own right.
Yeah, CEV itself seemed like a long shot—but my thought process was that maintaining human control wouldn’t be enough for step one, both because I think it’s not enough at the limit, but also because the human component might inherently be a limiting factor that makes it not very competitive. But the more I thought about it, the weaker that assumption of inherent-ness seemed, so I agree in that the most this post could be saying is that the timeline gap between something like Task AGI and figuring out step 2 is short—but which I expect isn’t very groundbreaking.
Also ,there’s no proof that CEV would work. Maybe values are incoherent.
The arbital article is no help.
How do we know that they would agree? That just begs the question. Saying that you shouldn’t be “jerky”, ie. selfish, doesn’t tell you what kind of unselfishness to have instead. Clearly ,the left and the right don’t agree on the best kind of altruism—laying down your life to stop the spread of socialism, versus sacrificing your income to implement socialism.
So if I’m understanding it correctly, it’s that maintaining human control is the best option we can formally work toward? The One True Encoding of Human Values would most likely be a more reliable system if we could create it, but that it’s a much harder problem, and not strictly necessary for a good end outcome?
My best guess is that this is a confused claim and I have trouble saying “yes” or “no” to it, but I do agree with the spirit of it.
Yes, in the same way that if you’re worried about mosquitoes giving you malaria, you start by getting a mosquito net or moving to a place without mosquitoes, you don’t immediately try to kill all mosquitoes in the world.