For all tasks A-I, most programs that we can imagine writing to do that task will need to search through various actions and evaluating the consequences. The only one of those tasks we currently know how to solve with a program is A, and chess programs do indeed use search and evaluation.
I’d guess that whether something can be done safely is mostly a function of how easy it is, and how isolated it is from the real world. The Riemann hypothesis seems pretty tricky, but it’s isolated from the real world, so it can probably be solved by a safe system. Chess is isolated and easy. Starting a billion dollar company is very entangled with the real world, and very tricky. So we probably couldn’t do it without a dangerous system.
This all makes sense, except for the bit where you draw the line at a certain level of “tricky and entangled with world”. Why isn’t it the case that danger only arises for the first AIs that can do tasks half as tricky? Twice as tricky? Ten times as tricky?
Consider what happens if you had to solve your list of problems and didn’t inherently care about human values? To what extent would you do ‘unfriendly’ things via consequentialism? How hard would you need to be constrained to stop doing that? Would it matter if you could also do far trickier things by using consequentialism and general power-seeking actions?
The reason, as I understand it, that a chess-playing AI does things the way we want it to is that we constrain the search space it can use because we can fully describe that space, rather than having to give it any means of using any other approaches, and for now that box is robust.
But if someone gave you or I the same task, we wouldn’t learn chess, we would buy a copy of Stockfish, or if it was a harder task (e.g. be better than AlphaZero) we’d go acquire resources using consequentialism. And it’s reasonable to think that if we gave a fully generic but powerful future AI the task of being the best at chess, at some point it’s going to figure out that the way to do that is acquire resources via consequentialism, and potentially to kill or destroy all its potential opponents. Winner.
Same with the poem or the hypothesis, I’m not going to be so foolish as to attack the problem directly unless it’s already pretty easy for me. And in order to get an AI to write a poem that good, I find it plausible that the path to doing that is less monkeys on a typewriter and more resource acquisition so I can understand the world well enough to do that. As a programmer of an AI, right now, the path is exactly that—it’s ‘build an AI that gets me enough more funding to potentially get something good enough to write that kind of poem,’ etc.
Another approach, and more directly a response to your question here, is to ask, which is easier for you/the-AI: Solving the problem head-on using only known-safe tactics and existing resources, or seeking power via consequentialism?
Yes, at some amount of endowment, I already have enough resources relative to the problem at hand and see a path to a solution, so I don’t bother looking elsewhere and just solve it, same as a human. But mostly no for anything really worth doing, which is the issue?
I agree with basically your whole comment. But it doesn’t seem like you’re engaging with the frame I’m using. I’m trying to figure out how agentic the first AI that can do task X is, for a range of X (with the hope that the first AI that can do X is not very agentic, for some X that is a pivotal task). The claim that a highly agentic highly intelligent AI will likely do undesirable things when presented with task X is very little evidence about this, because a highly agentic highly intelligent AI will likely do undesirable things when presented with almost any task.
Thank you, that is clarifying, together with your note to Scott on ACX about wanting it to ‘lack a motivational system.’ I want to see if I have this right before I give another shot at answering your actual question.
So as I understand your question now, what you’re asking is, will the first AI that can do (ideally pivotal) task X be of Type A (general, planning, motivational, agentic, models world, intelligent, etc) or Type B (basic, pattern matching, narrow, dumb, domain specific, constrained, boxed, etc).
I almost accidentally labeled A/B as G/N there, and I’m not sure if that’s a fair labeling system and want to see how close the mapping is? (e.g. narrow AI and general AI as usually understood). If not, is there a key difference?
Instead of “dumb” or “narrow” I’d say “having a strong comparative advantage in X (versus humans)”. E.g. imagine watching evolution and asking “will the first animals that take over the world be able to solve have already solved the Riemann hypothesis”, and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.
Similarly, I don’t expect that any AI which can do a bunch of groundbreaking science to be “narrow” by our current standards, but I do hope that they have a strong comparative disadvantage at taking-over-world-style tasks, compared with doing-science-style tasks.
And that’s related to agency, because what we mean by agency is not far off “having a comparative advantage in taking-over-world style tasks”.
Now, I expect that at some point, this line of reasoning stops being useful, because your systems are general enough and agentic enough that, even if their comparative advantage isn’t taking over the world, they can pretty easily do that anyway. But the question is whether this line of reasoning is still useful for the first systems which can do pivotal task X. Eliezer thinks no, because he considers intelligence and agency to be very strongly linked. I’m less sure, because humans have been evolved really hard to be agentic, so I’d be surprised if you couldn’t beat us at a bunch of intellectual tasks while being much less agentic than us.
Side note: I meant “pattern-matching” as a gesture towards “the bit of general intelligence that doesn’t require agency” (although in hindsight I can see how this is confusing, I’ve just made an edit on the ACX comment).
“will the first animals that take over the world be able to solve the Riemann hypothesis”, and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.
Pardon the semantics, but I think the question you want to use here is “will the first animals that take over the world have already solved the Riemann hypothesis”. IMO humans do have the ability (“can”) to solve the Riemann hypothesis, and the point you’re making is just about the ordering in which we’ve done things.
I’m not sure this is the most useful way to think about it, either, because it includes the possibility that we didn’t solve the Riemann hypothesis first just because we weren’t really interested in it, not because of any kind of inherent difficulty to the problem or our suitability to solving it earlier. I think you’d want to consider:
alternative histories where solving the Riemann hypothesis was a (or the) main goal for humanity, and
alternative histories where world takeover was a (or the) main goal for humanity (our own actual history might be close enough)
and ask if we solve the Riemann hypothesis at earlier average times in worlds like 1 than we take over the world in worlds like 2.
We might also be able to imagine species that could take over the world but seem to have no hope of ever solving the Riemann hypothesis, and I think we want to distinguish that from just happening to not solve it first. Depending on what you mean by “taking over the world”, other animals may have done so before us, too, e.g. arthropods. Or even plants or other forms of life more or before any group of animals, even all animals combined.
No one actually knows the exact task-difficulty threshold, but the intuition is that once a task is hard enough, any AI capable of completing the task is also capable of thinking of strategies that involve betraying its human creators. However, even if I don’t know the exact threshold, I can think of examples that should definitely be above the line. Starting a billion dollar company seems pretty difficult, but it could maybe be achieved by an special-purpose algorithm that just plays the stock market really well. But if we add a few more stipulations, like that the company has to make money by building an actual product, in an established industry with lots of competition, then probably that can only be done by a dangerous algorithm. It’s not a very big step from “figuring out how to outwit your competitors” to “realizing that you could outwit humans in general”.
An implicit assumption here is that I’m drawing the line between “safe” and “dangerous” at the point where the algorithm realizes that it could potentially achieve higher utility by betraying us. It’s possible that an algorithm could realize this, but still not be strong enough to “win” against humanity.
The easiest way is probably to build a modestly-sized company doing software and then find a way to destabilize the government and cause hyperinflation.
I think the rule of thumb should be: if your AI could be intentionally deployed to take over the world, it’s highly likely to do so unintentionally.
For all tasks A-I, most programs that we can imagine writing to do that task will need to search through various actions and evaluating the consequences. The only one of those tasks we currently know how to solve with a program is A, and chess programs do indeed use search and evaluation.
I’d guess that whether something can be done safely is mostly a function of how easy it is, and how isolated it is from the real world. The Riemann hypothesis seems pretty tricky, but it’s isolated from the real world, so it can probably be solved by a safe system. Chess is isolated and easy. Starting a billion dollar company is very entangled with the real world, and very tricky. So we probably couldn’t do it without a dangerous system.
This all makes sense, except for the bit where you draw the line at a certain level of “tricky and entangled with world”. Why isn’t it the case that danger only arises for the first AIs that can do tasks half as tricky? Twice as tricky? Ten times as tricky?
Consider what happens if you had to solve your list of problems and didn’t inherently care about human values? To what extent would you do ‘unfriendly’ things via consequentialism? How hard would you need to be constrained to stop doing that? Would it matter if you could also do far trickier things by using consequentialism and general power-seeking actions?
The reason, as I understand it, that a chess-playing AI does things the way we want it to is that we constrain the search space it can use because we can fully describe that space, rather than having to give it any means of using any other approaches, and for now that box is robust.
But if someone gave you or I the same task, we wouldn’t learn chess, we would buy a copy of Stockfish, or if it was a harder task (e.g. be better than AlphaZero) we’d go acquire resources using consequentialism. And it’s reasonable to think that if we gave a fully generic but powerful future AI the task of being the best at chess, at some point it’s going to figure out that the way to do that is acquire resources via consequentialism, and potentially to kill or destroy all its potential opponents. Winner.
Same with the poem or the hypothesis, I’m not going to be so foolish as to attack the problem directly unless it’s already pretty easy for me. And in order to get an AI to write a poem that good, I find it plausible that the path to doing that is less monkeys on a typewriter and more resource acquisition so I can understand the world well enough to do that. As a programmer of an AI, right now, the path is exactly that—it’s ‘build an AI that gets me enough more funding to potentially get something good enough to write that kind of poem,’ etc.
Another approach, and more directly a response to your question here, is to ask, which is easier for you/the-AI: Solving the problem head-on using only known-safe tactics and existing resources, or seeking power via consequentialism?
Yes, at some amount of endowment, I already have enough resources relative to the problem at hand and see a path to a solution, so I don’t bother looking elsewhere and just solve it, same as a human. But mostly no for anything really worth doing, which is the issue?
I agree with basically your whole comment. But it doesn’t seem like you’re engaging with the frame I’m using. I’m trying to figure out how agentic the first AI that can do task X is, for a range of X (with the hope that the first AI that can do X is not very agentic, for some X that is a pivotal task). The claim that a highly agentic highly intelligent AI will likely do undesirable things when presented with task X is very little evidence about this, because a highly agentic highly intelligent AI will likely do undesirable things when presented with almost any task.
Thank you, that is clarifying, together with your note to Scott on ACX about wanting it to ‘lack a motivational system.’ I want to see if I have this right before I give another shot at answering your actual question.
So as I understand your question now, what you’re asking is, will the first AI that can do (ideally pivotal) task X be of Type A (general, planning, motivational, agentic, models world, intelligent, etc) or Type B (basic, pattern matching, narrow, dumb, domain specific, constrained, boxed, etc).
I almost accidentally labeled A/B as G/N there, and I’m not sure if that’s a fair labeling system and want to see how close the mapping is? (e.g. narrow AI and general AI as usually understood). If not, is there a key difference?
Instead of “dumb” or “narrow” I’d say “having a strong comparative advantage in X (versus humans)”. E.g. imagine watching evolution and asking “will the first animals that take over the world
be able to solvehave already solved the Riemann hypothesis”, and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.Similarly, I don’t expect that any AI which can do a bunch of groundbreaking science to be “narrow” by our current standards, but I do hope that they have a strong comparative disadvantage at taking-over-world-style tasks, compared with doing-science-style tasks.
And that’s related to agency, because what we mean by agency is not far off “having a comparative advantage in taking-over-world style tasks”.
Now, I expect that at some point, this line of reasoning stops being useful, because your systems are general enough and agentic enough that, even if their comparative advantage isn’t taking over the world, they can pretty easily do that anyway. But the question is whether this line of reasoning is still useful for the first systems which can do pivotal task X. Eliezer thinks no, because he considers intelligence and agency to be very strongly linked. I’m less sure, because humans have been evolved really hard to be agentic, so I’d be surprised if you couldn’t beat us at a bunch of intellectual tasks while being much less agentic than us.
Side note: I meant “pattern-matching” as a gesture towards “the bit of general intelligence that doesn’t require agency” (although in hindsight I can see how this is confusing, I’ve just made an edit on the ACX comment).
Pardon the semantics, but I think the question you want to use here is “will the first animals that take over the world have already solved the Riemann hypothesis”. IMO humans do have the ability (“can”) to solve the Riemann hypothesis, and the point you’re making is just about the ordering in which we’ve done things.
Yes, sorry, you’re right; edited.
I’m not sure this is the most useful way to think about it, either, because it includes the possibility that we didn’t solve the Riemann hypothesis first just because we weren’t really interested in it, not because of any kind of inherent difficulty to the problem or our suitability to solving it earlier. I think you’d want to consider:
alternative histories where solving the Riemann hypothesis was a (or the) main goal for humanity, and
alternative histories where world takeover was a (or the) main goal for humanity (our own actual history might be close enough)
and ask if we solve the Riemann hypothesis at earlier average times in worlds like 1 than we take over the world in worlds like 2.
We might also be able to imagine species that could take over the world but seem to have no hope of ever solving the Riemann hypothesis, and I think we want to distinguish that from just happening to not solve it first. Depending on what you mean by “taking over the world”, other animals may have done so before us, too, e.g. arthropods. Or even plants or other forms of life more or before any group of animals, even all animals combined.
No one actually knows the exact task-difficulty threshold, but the intuition is that once a task is hard enough, any AI capable of completing the task is also capable of thinking of strategies that involve betraying its human creators. However, even if I don’t know the exact threshold, I can think of examples that should definitely be above the line. Starting a billion dollar company seems pretty difficult, but it could maybe be achieved by an special-purpose algorithm that just plays the stock market really well. But if we add a few more stipulations, like that the company has to make money by building an actual product, in an established industry with lots of competition, then probably that can only be done by a dangerous algorithm. It’s not a very big step from “figuring out how to outwit your competitors” to “realizing that you could outwit humans in general”.
An implicit assumption here is that I’m drawing the line between “safe” and “dangerous” at the point where the algorithm realizes that it could potentially achieve higher utility by betraying us. It’s possible that an algorithm could realize this, but still not be strong enough to “win” against humanity.
The easiest way is probably to build a modestly-sized company doing software and then find a way to destabilize the government and cause hyperinflation.
I think the rule of thumb should be: if your AI could be intentionally deployed to take over the world, it’s highly likely to do so unintentionally.