A conversation I’ve had several times in recent weeks is with people who argue that we can create human-level intelligences (who are safe, because they’re only human-level) and somehow use them to get alignment/pivotal act, or something like “just stop there”.
And I think the recursive self-improvement is one answer to why human-level AIs are not safe. Actual humans attempt to improve their intelligence already (e.g. nootropics), it’s just hard with our architecture. I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.
I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.
I agree that human-level AIs will definitely try the same thing, but it’s not obvious to me that it will actually be much easier for them. Current machine learning techniques produce models that are hard to optimize for basically the same reasons that brains are; AIs will be easier to optimize for various reasons but I don’t think it will be nearly as extreme as this sentence makes it sound.
I naively expect the option of “take whatever model constitutes your mind and run it on faster hardware and/or duplicate it” should be relatively easy and likely to lead to fairly extreme gains.
Ruby isn’t saying that computers have faster clock speeds than biological brains (which is definitely true), he’s claiming something like “after we have human-level AI, AIs will be able to get rapidly more powerful by running on faster hardware”; the speed increase is relative to some other computers, so the speed difference between brains and computers isn’t relevant.
Also, running faster and duplicating yourself keeps the model human-level in an important sense. A lot of threat models run through the model doing things that humans can’t understand even given a lot of time, and so those threat models require something stronger than just this.
I think clever duplication of human intelligence is plenty sufficient for general superhuman capacity in the important sense (wherein I mean something like ‘it has capacities such that would be extincion causing if (it believes) minimizing its loss function is achieved by turning off humanity (which could turn it off/ start other (proto-)agis)’).
for one, I don’t think humanity is that robust in the status quo, and 2, a team of internally aligned (because copies) human level intelligence capable of graduate level biology seems plenty existentially scary.
Other issues with “just stop at human-level” include:
We don’t actually know how to usefully measure or upper-bound the capability of an AGI. Relying on past trends in ‘how much performance tends to scale with compute’ seems extremely unreliable and dangerous to me when you first hit AGI. And it becomes completely unreliable once the system is potentially modeling its operators and adjusting its visible performance in attempts to influence operators’ beliefs.
AI will never have the exact same skills as humans. At some levels, AI might be subhuman in many ways, superhuman in many others, and roughly par-human in still others. Safety in that case will depend on the specific skills the AI does or doesn’t have.
Usefulness/relevance will also depend on the specific skills the AI has. Some “human-level AIs” may be useless for pivotal acts, even if you know how to perfectly align them.
I endorse “don’t crank your first AGI systems up to maximum”—cranking up to maximum seems obviously suicidal to me. Limiting capabilities is absolutely essential.
But I don’t think this solves the problem on its own, and I think achieving this will be more complicated and precarious than the phrasing “human-level AI” might suggest.
I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.
Seems to me that optimizing flesh brains is easier than optimizing code and silicon hardware. It’s so easy, evolution can do it despite being very dumb.
Roughly speaking the part that makes it easy is that the effects of flesh brains are additive with respect to the variables one might modify (standing genetic variation), whereas the effects of hardware and software are very nonlinear with respect to the variables one might modify (circuit connectivity(?) and code characters).
We haven’t made much progress on optimizing humans, but that’s less because optimizing humans is hard and more because humans prefer using the resources that could’ve been used for optimizing humans for self-preservation instead.
For example, if a human says “I’d like to make a similar brain as mine, but with 80% more neurons per cortical minicolumn”, there’s no way to actually do that, at least not without spending decades or centuries on basic bio-engineering research.
By contrast, if an ANN-based AGI says “I’d like to make a similar ANN as mine, but with 80% more neurons per layer”, they can actually do that experiment immediately.
First, some types of software can be largely additive wrt their variables, e.g. neural nets, that’s basically why SGD works. Second, software has lots of other huge advantages like rapid iteration times, copyability and inspectability of intermediate states.
Hm, maybe there are two reasons why human-level AIs are safe:
1. A bunch of our alignment techniques work better when the overseer can understand what the AIs are doing (given enough time). This means that human-level AIs are actually aligned. 2. Even if the human-level AIs misbehave, they’re just human-level, so they can’t take over the world.
Under model (1), it’s totally ok that self-improvement is an option, because we’ll be able to train our AIs to not do that.
Under model (2), there are definitely some concerning scenarios here where the AIs e.g. escape onto the internet, then use their code to get resources, duplicate themselves a bunch of times, and set-up a competing AI development project. Which might have an advantage if it can care less about paying alignment taxes, in some ways.
Agree it’s not clear. Some reasons why they might:
If training environments’ inductive biases point firmly towards some specific (non-human) values, then maybe the misaligned AIs can just train bigger and better AI systems using similar environments that they were trained in, and hope that those AIs will end up with similar values.
Maybe values can differ a bit, and cosmopolitanism or decision theory can carry the rest of the way. Just like Paul says he’d be pretty happy with intelligent life that came from a similar distribution that our civ came from.
Humans might need to use a bunch of human labor to oversee all their human-level AIs. The HLAIs can skip this, insofar as they can trust copies of themself. And when training even smarter AI, it’s a nice benefit to have cheap copyable trustworthy human-level overseers.
Maybe you can somehow gradually increase the capabilities of your HLAIs in a way that preserves their values.
(You have a lot of high-quality labor at this point, which really helps for interpretability and making improvements through other ways than gradient descent.)
I don’t think human level AIs are safe, but I also think it’s pretty clear they’re not so dangerous that it’s impossible to use them without destroying the world. We can probably prevent them from being able to modify themselves, if we are sufficiently careful.
“A human level AI will recursively self improve to superintelligence if we let it” isn’t really that solid an argument here, I think.
A conversation I’ve had several times in recent weeks is with people who argue that we can create human-level intelligences (who are safe, because they’re only human-level) and somehow use them to get alignment/pivotal act, or something like “just stop there”.
And I think the recursive self-improvement is one answer to why human-level AIs are not safe. Actual humans attempt to improve their intelligence already (e.g. nootropics), it’s just hard with our architecture. I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.
I agree that human-level AIs will definitely try the same thing, but it’s not obvious to me that it will actually be much easier for them. Current machine learning techniques produce models that are hard to optimize for basically the same reasons that brains are; AIs will be easier to optimize for various reasons but I don’t think it will be nearly as extreme as this sentence makes it sound.
I naively expect the option of “take whatever model constitutes your mind and run it on faster hardware and/or duplicate it” should be relatively easy and likely to lead to fairly extreme gains.
I agree we can duplicate models once we’ve trained them, this seems like the strongest argument here.
What do you mean by “run on faster hardware”? Faster than what?
Faster than biological brains, by 6 orders of magnitude.
Ruby isn’t saying that computers have faster clock speeds than biological brains (which is definitely true), he’s claiming something like “after we have human-level AI, AIs will be able to get rapidly more powerful by running on faster hardware”; the speed increase is relative to some other computers, so the speed difference between brains and computers isn’t relevant.
Also, running faster and duplicating yourself keeps the model human-level in an important sense. A lot of threat models run through the model doing things that humans can’t understand even given a lot of time, and so those threat models require something stronger than just this.
I think clever duplication of human intelligence is plenty sufficient for general superhuman capacity in the important sense (wherein I mean something like ‘it has capacities such that would be extincion causing if (it believes) minimizing its loss function is achieved by turning off humanity (which could turn it off/ start other (proto-)agis)’).
for one, I don’t think humanity is that robust in the status quo, and 2, a team of internally aligned (because copies) human level intelligence capable of graduate level biology seems plenty existentially scary.
Other issues with “just stop at human-level” include:
We don’t actually know how to usefully measure or upper-bound the capability of an AGI. Relying on past trends in ‘how much performance tends to scale with compute’ seems extremely unreliable and dangerous to me when you first hit AGI. And it becomes completely unreliable once the system is potentially modeling its operators and adjusting its visible performance in attempts to influence operators’ beliefs.
AI will never have the exact same skills as humans. At some levels, AI might be subhuman in many ways, superhuman in many others, and roughly par-human in still others. Safety in that case will depend on the specific skills the AI does or doesn’t have.
Usefulness/relevance will also depend on the specific skills the AI has. Some “human-level AIs” may be useless for pivotal acts, even if you know how to perfectly align them.
I endorse “don’t crank your first AGI systems up to maximum”—cranking up to maximum seems obviously suicidal to me. Limiting capabilities is absolutely essential.
But I don’t think this solves the problem on its own, and I think achieving this will be more complicated and precarious than the phrasing “human-level AI” might suggest.
Seems to me that optimizing flesh brains is easier than optimizing code and silicon hardware. It’s so easy, evolution can do it despite being very dumb.
Roughly speaking the part that makes it easy is that the effects of flesh brains are additive with respect to the variables one might modify (standing genetic variation), whereas the effects of hardware and software are very nonlinear with respect to the variables one might modify (circuit connectivity(?) and code characters).
We haven’t made much progress on optimizing humans, but that’s less because optimizing humans is hard and more because humans prefer using the resources that could’ve been used for optimizing humans for self-preservation instead.
Why the disagree vote?
For example, if a human says “I’d like to make a similar brain as mine, but with 80% more neurons per cortical minicolumn”, there’s no way to actually do that, at least not without spending decades or centuries on basic bio-engineering research.
By contrast, if an ANN-based AGI says “I’d like to make a similar ANN as mine, but with 80% more neurons per layer”, they can actually do that experiment immediately.
First, some types of software can be largely additive wrt their variables, e.g. neural nets, that’s basically why SGD works. Second, software has lots of other huge advantages like rapid iteration times, copyability and inspectability of intermediate states.
Hm, maybe there are two reasons why human-level AIs are safe:
1. A bunch of our alignment techniques work better when the overseer can understand what the AIs are doing (given enough time). This means that human-level AIs are actually aligned.
2. Even if the human-level AIs misbehave, they’re just human-level, so they can’t take over the world.
Under model (1), it’s totally ok that self-improvement is an option, because we’ll be able to train our AIs to not do that.
Under model (2), there are definitely some concerning scenarios here where the AIs e.g. escape onto the internet, then use their code to get resources, duplicate themselves a bunch of times, and set-up a competing AI development project. Which might have an advantage if it can care less about paying alignment taxes, in some ways.
I unconfidently suspect that human-level AIs won’t have a much easier time with the alignment problem than we expect to have.
Agree it’s not clear. Some reasons why they might:
If training environments’ inductive biases point firmly towards some specific (non-human) values, then maybe the misaligned AIs can just train bigger and better AI systems using similar environments that they were trained in, and hope that those AIs will end up with similar values.
Maybe values can differ a bit, and cosmopolitanism or decision theory can carry the rest of the way. Just like Paul says he’d be pretty happy with intelligent life that came from a similar distribution that our civ came from.
Humans might need to use a bunch of human labor to oversee all their human-level AIs. The HLAIs can skip this, insofar as they can trust copies of themself. And when training even smarter AI, it’s a nice benefit to have cheap copyable trustworthy human-level overseers.
Maybe you can somehow gradually increase the capabilities of your HLAIs in a way that preserves their values.
(You have a lot of high-quality labor at this point, which really helps for interpretability and making improvements through other ways than gradient descent.)
I don’t think human level AIs are safe, but I also think it’s pretty clear they’re not so dangerous that it’s impossible to use them without destroying the world. We can probably prevent them from being able to modify themselves, if we are sufficiently careful.
“A human level AI will recursively self improve to superintelligence if we let it” isn’t really that solid an argument here, I think.