Suppose we took this whole post and substituted every instance of “cure cancer” with the following: Version A: “win a chess game against a grandmaster” Version B: “write a Shakespeare-level poem” Version C: “solve the Riemann hypothesis” Version D: “found a billion-dollar company” Version E: “cure cancer” Version F: “found a ten-trillion-dollar company” Version G: “take over the USA” Version H: “solve the alignment problem” Version I: “take over the galaxy”
And so on. Now, the argument made in version A of the post clearly doesn’t work, the argument in version B very likely doesn’t work, and I’d guess that the argument in version C doesn’t work either. Suppose I concede, though, that the argument in version I works: that searching for an oracle smart enough to give us a successful plan for taking over the galaxy will very likely lead us to develop an agentic, misaligned AGI. Then that still leaves us with the question: what about versions D, E, F, G and H? The argument is structurally identical in each case—so what is it about “curing cancer” that is so hard that, unlike winning chess or (possibly) solving the Riemann hypothesis, when we train for that we’ll get misaligned agents instead?
We might say: well, for humans, curing cancer requires high levels of agency. But humans are really badly optimised for many types of abstract thinking—hence why we can be beaten at chess so easily. So why can’t we also be beaten at curing cancer by systems less agentic than us?
Eliezer has a bunch of intuitions which tell him where the line of “things we can’t do with non-dangerous systems” should be drawn, which I freely agree I don’t understand (although I will note that it’s suspicious how most people can’t do things on the far side of his line, but Einstein can). But insofar as this post doesn’t consider which side of the line curing cancer is actually on, then I don’t think it’s correctly diagnosed the place where Eliezer and I are bouncing off each other.
For all tasks A-I, most programs that we can imagine writing to do that task will need to search through various actions and evaluating the consequences. The only one of those tasks we currently know how to solve with a program is A, and chess programs do indeed use search and evaluation.
I’d guess that whether something can be done safely is mostly a function of how easy it is, and how isolated it is from the real world. The Riemann hypothesis seems pretty tricky, but it’s isolated from the real world, so it can probably be solved by a safe system. Chess is isolated and easy. Starting a billion dollar company is very entangled with the real world, and very tricky. So we probably couldn’t do it without a dangerous system.
This all makes sense, except for the bit where you draw the line at a certain level of “tricky and entangled with world”. Why isn’t it the case that danger only arises for the first AIs that can do tasks half as tricky? Twice as tricky? Ten times as tricky?
Consider what happens if you had to solve your list of problems and didn’t inherently care about human values? To what extent would you do ‘unfriendly’ things via consequentialism? How hard would you need to be constrained to stop doing that? Would it matter if you could also do far trickier things by using consequentialism and general power-seeking actions?
The reason, as I understand it, that a chess-playing AI does things the way we want it to is that we constrain the search space it can use because we can fully describe that space, rather than having to give it any means of using any other approaches, and for now that box is robust.
But if someone gave you or I the same task, we wouldn’t learn chess, we would buy a copy of Stockfish, or if it was a harder task (e.g. be better than AlphaZero) we’d go acquire resources using consequentialism. And it’s reasonable to think that if we gave a fully generic but powerful future AI the task of being the best at chess, at some point it’s going to figure out that the way to do that is acquire resources via consequentialism, and potentially to kill or destroy all its potential opponents. Winner.
Same with the poem or the hypothesis, I’m not going to be so foolish as to attack the problem directly unless it’s already pretty easy for me. And in order to get an AI to write a poem that good, I find it plausible that the path to doing that is less monkeys on a typewriter and more resource acquisition so I can understand the world well enough to do that. As a programmer of an AI, right now, the path is exactly that—it’s ‘build an AI that gets me enough more funding to potentially get something good enough to write that kind of poem,’ etc.
Another approach, and more directly a response to your question here, is to ask, which is easier for you/the-AI: Solving the problem head-on using only known-safe tactics and existing resources, or seeking power via consequentialism?
Yes, at some amount of endowment, I already have enough resources relative to the problem at hand and see a path to a solution, so I don’t bother looking elsewhere and just solve it, same as a human. But mostly no for anything really worth doing, which is the issue?
I agree with basically your whole comment. But it doesn’t seem like you’re engaging with the frame I’m using. I’m trying to figure out how agentic the first AI that can do task X is, for a range of X (with the hope that the first AI that can do X is not very agentic, for some X that is a pivotal task). The claim that a highly agentic highly intelligent AI will likely do undesirable things when presented with task X is very little evidence about this, because a highly agentic highly intelligent AI will likely do undesirable things when presented with almost any task.
Thank you, that is clarifying, together with your note to Scott on ACX about wanting it to ‘lack a motivational system.’ I want to see if I have this right before I give another shot at answering your actual question.
So as I understand your question now, what you’re asking is, will the first AI that can do (ideally pivotal) task X be of Type A (general, planning, motivational, agentic, models world, intelligent, etc) or Type B (basic, pattern matching, narrow, dumb, domain specific, constrained, boxed, etc).
I almost accidentally labeled A/B as G/N there, and I’m not sure if that’s a fair labeling system and want to see how close the mapping is? (e.g. narrow AI and general AI as usually understood). If not, is there a key difference?
Instead of “dumb” or “narrow” I’d say “having a strong comparative advantage in X (versus humans)”. E.g. imagine watching evolution and asking “will the first animals that take over the world be able to solve have already solved the Riemann hypothesis”, and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.
Similarly, I don’t expect that any AI which can do a bunch of groundbreaking science to be “narrow” by our current standards, but I do hope that they have a strong comparative disadvantage at taking-over-world-style tasks, compared with doing-science-style tasks.
And that’s related to agency, because what we mean by agency is not far off “having a comparative advantage in taking-over-world style tasks”.
Now, I expect that at some point, this line of reasoning stops being useful, because your systems are general enough and agentic enough that, even if their comparative advantage isn’t taking over the world, they can pretty easily do that anyway. But the question is whether this line of reasoning is still useful for the first systems which can do pivotal task X. Eliezer thinks no, because he considers intelligence and agency to be very strongly linked. I’m less sure, because humans have been evolved really hard to be agentic, so I’d be surprised if you couldn’t beat us at a bunch of intellectual tasks while being much less agentic than us.
Side note: I meant “pattern-matching” as a gesture towards “the bit of general intelligence that doesn’t require agency” (although in hindsight I can see how this is confusing, I’ve just made an edit on the ACX comment).
“will the first animals that take over the world be able to solve the Riemann hypothesis”, and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.
Pardon the semantics, but I think the question you want to use here is “will the first animals that take over the world have already solved the Riemann hypothesis”. IMO humans do have the ability (“can”) to solve the Riemann hypothesis, and the point you’re making is just about the ordering in which we’ve done things.
I’m not sure this is the most useful way to think about it, either, because it includes the possibility that we didn’t solve the Riemann hypothesis first just because we weren’t really interested in it, not because of any kind of inherent difficulty to the problem or our suitability to solving it earlier. I think you’d want to consider:
alternative histories where solving the Riemann hypothesis was a (or the) main goal for humanity, and
alternative histories where world takeover was a (or the) main goal for humanity (our own actual history might be close enough)
and ask if we solve the Riemann hypothesis at earlier average times in worlds like 1 than we take over the world in worlds like 2.
We might also be able to imagine species that could take over the world but seem to have no hope of ever solving the Riemann hypothesis, and I think we want to distinguish that from just happening to not solve it first. Depending on what you mean by “taking over the world”, other animals may have done so before us, too, e.g. arthropods. Or even plants or other forms of life more or before any group of animals, even all animals combined.
No one actually knows the exact task-difficulty threshold, but the intuition is that once a task is hard enough, any AI capable of completing the task is also capable of thinking of strategies that involve betraying its human creators. However, even if I don’t know the exact threshold, I can think of examples that should definitely be above the line. Starting a billion dollar company seems pretty difficult, but it could maybe be achieved by an special-purpose algorithm that just plays the stock market really well. But if we add a few more stipulations, like that the company has to make money by building an actual product, in an established industry with lots of competition, then probably that can only be done by a dangerous algorithm. It’s not a very big step from “figuring out how to outwit your competitors” to “realizing that you could outwit humans in general”.
An implicit assumption here is that I’m drawing the line between “safe” and “dangerous” at the point where the algorithm realizes that it could potentially achieve higher utility by betraying us. It’s possible that an algorithm could realize this, but still not be strong enough to “win” against humanity.
The easiest way is probably to build a modestly-sized company doing software and then find a way to destabilize the government and cause hyperinflation.
I think the rule of thumb should be: if your AI could be intentionally deployed to take over the world, it’s highly likely to do so unintentionally.
My understanding is that you can’t safely do even A with an arbitrarily powerful optimizer. An arbitrarily powerful optimizer who’s reward function is solely “beat the grandmaster” would do everything possible to ensure it’s reward function is maximised with the highest probability. For instance, it might amass as much compute as possible to ensure that it’s made no errors at all, it might armor it’s servers to ensure no one switches it off, and of course, it might pharmacologically mess with the grandmaster to inhibit their performance.
The fact that it can be done safely by a weak AI isn’t to say that it’s safe to do with a powerful AI.
For the purposes of this argument, I’m interested in what can be done safely by some AI we can build. If you can solve alignment safely with some AI, then you’re in a good situation. What an arbitrarily powerful optimiser will do isn’t the crux, we all agree that’s dangerous.
Yes, it definitely doesn’t work with A or C. It might work with B, because judging whether a poem is Shakespeare-level or not is heavily entangled with human society and culture and it may turn out that manipulating humans to rave about whatever you wrote (whether it’s actually Shakespeare-level poetry or not) might be easier. I expect not, but it’s hard to be sure. I would certainly put C as safer than B.
That was my intuition as well. A and C are just not entangled with the physical world at all. B is a maybe; it’s a big leap from poetry to taking over the world, but humans are something that has to be modelled and that’s where trouble starts.
Looking at the A substitution, why doesn’t this argument work?
I think by “win a chess game against a grandmaster” you are specifically asking about the game itself. In real life we also have to arrange the game, stay alive until the game, etc. Let’s take all that out of scope, it’s obviously unsafe.
If there were a list of all the possible plans that win a chess game against a grandmaster, ranked by “likely to work”, most of the plans that might work route through “consequentalism”, and “acquire resources.”
Now, say you build an oracle AI. You’ve done all the things to try and make it interpretable and honest and such. If you ask it for a plan to win a chess game against a grandmaster, what happens?
Well it definitely doesn’t give you a plan like “If the grandmaster plays e4, you play e5, and then if they play d4, you play f5, …” because that plan is too large. I think the desired outcome is a plan like “open with pawn to d4, observe the board position, then ask for another plan”. Are Oracle AIs allowed to provide self-referential plans?
Regardless, if I’m an Oracle AI looking for the most likely plan, I’m now very concerned that you’ll have a heart attack, or an attack of arrogance, or otherwise mess up my perfect plan. Unlikely, sure, but I’m searching for the most “likely to work” here. So the actual plan I give you is “ask the grandmaster how his trip to Madrid went, then ask me for another plan”. Then the grandmaster realizes that I know about his(*eg) affair and will reveal it if he wins, and he attempts to lose as gracefully as possible. So now the outcome is much more robust to events.
I agree that highly agentic versions of the system will complete the tasks better. My claim is just that they’re not necessary to complete the task very well, and so we shouldn’t be confident that selection for completing that task very well will end up producing the highly agentic versions.
The part where alignment is hard is precisely when the thing I’m trying to accomplish is hard. Because then I need a powerful plan, and it’s hard to specify a search for powerful plans that don’t kill everyone.
I now read you as pointing to chess as:
It is “hard to accomplish” from the perspective of human cognition.
It does not require a “powerful”/”agentic” plan.
It’s “easy” to specify a search for a good plan, we already did it.
Yepp. And clearly alignment is much harder than chess, but it seems like an open question whether it’s harder than “kill everyone” (and even if it is, there’s an open question of how much of an advantage we get from doing our best to point the system at the former not the latter).
“Kill everyone” seems like it should be “easy”, because there are so many ways to do it: humans only survive in environments with a specific range of temperatures, pressures, atmospheric contents, availability of human-digestible food, &c.
I agree the argument doesn’t work for A, B, and C, but I think the way it doesn’t work should make you pessimistic about how much we can trust the outputted plans in more complex task domains.
For A, it doesn’t seem certain to me that the AI will only generate plans which only involve making a chess move. It has no reason to prefer simpler plans over more complex, and it may gain a lot by suggesting that the player, for instance, lobotomize their opponent, or hook it up to some agent AI in a way such that most worlds lead to it playing against a grandmaster with a lobotomy.
If you penalize the AI significantly for thought cycles, it will just output 100 different ways of connecting itself to an agent (or otherwise producing an optimization process which achieves it’s goal). If you don’t penalize it very much for thought cycles it will come up with a way to win against it’s opponent, then add on a bunch of ways to ensure they’re lobotomized before the match.
Most ways of defining these goals seems like it always leads to most action sequences or all of them being bad, or having bad components given too little or too much penalization for thought cycles, so as your ability to foresee the consequences of the actions taken decreases, you should also dramatically decrease the expected value of any particular generated plan. This means that in domains where the agent is actually useful, action plans which are easy to verify without actually executing are the only ones which can be used. This means D, E, F, G, I, and possibly H depending on the form the solution takes all pose astronomical risks.
Another possible solution would be to estimate how many thought cycles it should take to solve the problem, as well as how accurate that estimate needs to be in order to not result in optimisers or lobotomies, then only use solutions in that range.
Edit: the point is that the simpler cases don’t work because it’s very easy to verify the actions are in an action space which is easy to verify won’t lead to catastrophic failure. For A you can just make sure the action space is that of chess moves, for C of chess proofs.
But I think Richard’s point is ‘but we totally built AIs that defeated grand chess masters without destroying the world. So, clearly it’s possible to use tool AI to do this sort of thing. So… why do think various domains will reliably output horrible outcomes? If you need to cure cancer, maybe there is an analogous way to cure cancer that just… isn’t trying that hard?’
The reason why we can easily make AIs which solve chess without destroying the world is because we can make specialized AIs such that they can only operate in the theoretical environment of states of chess boards, and in that environment we can tell it exactly what it’s goal should be.
If we tell an AGI to generate plans for winning at chess, and it knows about the outside world, then because the state space is astronomically larger, it is astronomically more difficult to tell it what it’s goal should be, and so any goal we do give it either satisfies corrigibility, and we can tell it “do what I want”, or incompletely captures what we mean by ‘win this chess game’.
For cancer, there may well be a way to solve the problem using a specialized AI, which works in an environment space simple enough that we can completely specify our goal. I assume though that we are using a general AI in all the hypothetical versions of the problem, which has the property ‘it’s working in an environment space large enough that we can’t specify what we want it to do’ or if it doesn’t know a priori it’s plans can affect the state of a far larger environment space which can affect the environment space it cares about, it may deduce this, and figure out a way to exploit this feature.
I might be conflating Richard, Paul, and my own guesses here. But, I think part of the argument here is about what can happen before AGI, that gives us lines of hope to pursue.
Like, my-model-of-Paul wants various tools for amplifying his own thought to (among other things) help think about solving the longterm alignment problem. And the question is whether there are ways of doing that that actually help when trying to solve the sorts of problems Paul wants to solve. We’ve successfully augmented human arithmetic and chess. Are there tools we actually wish we had, that narrow AI meaningfully helps with,
I’m not sure if Richard has a particular strategy in mind, but I assume he’s exploring the broader question of “what useful things can we build that will help navigate x-risk”
The original dialogs were exploring the concept of pivotal acts that could change humanity’s strategic position. Are there AIs that can execute pivotal acts that are more like calculators and Deep Blue than like autonomous moon-base-builders? (I don’t know if Richard actually shares the pivotal act / acute risk period frame, or was just accepting it for sake of argument)
The problem is not with whether we call the AI AGI or not, it’s whether we can either 1) fully specify our goals in the environment space it’s able to model (or otherwise not care too deeply about the environment space it’s able to model), or 2) verify the effects of the actions it says to do have no disastrous consequences.
To determine whether a tool AI can be used to solve problems Paul wants to solve, or execute pivotal acts, we need a to both 1) determine that the environment is small enough for us to accurately express our goal, and 2) ensure the AI is unable to infer the existence of a broader environment.
(meta note: I’m making a lot of very confident statements, and very few are of the form “<statement>, unless <other statement>, then <statement> may not be true”. This means I am almost certainly overconfident, and my model is incomplete, but I’m making the claims anyway so that they can be developed)
This is what I came here to say! I think you point out a crisp reason why some task settings make alignment harder than others, and why we get catastrophically optimized against by some kinds of smart agents but not others (like Deep Blue).
Suppose we took this whole post and substituted every instance of “cure cancer” with the following:
Version A: “win a chess game against a grandmaster”
Version B: “write a Shakespeare-level poem”
Version C: “solve the Riemann hypothesis”
Version D: “found a billion-dollar company”
Version E: “cure cancer”
Version F: “found a ten-trillion-dollar company”
Version G: “take over the USA”
Version H: “solve the alignment problem”
Version I: “take over the galaxy”
And so on. Now, the argument made in version A of the post clearly doesn’t work, the argument in version B very likely doesn’t work, and I’d guess that the argument in version C doesn’t work either. Suppose I concede, though, that the argument in version I works: that searching for an oracle smart enough to give us a successful plan for taking over the galaxy will very likely lead us to develop an agentic, misaligned AGI. Then that still leaves us with the question: what about versions D, E, F, G and H? The argument is structurally identical in each case—so what is it about “curing cancer” that is so hard that, unlike winning chess or (possibly) solving the Riemann hypothesis, when we train for that we’ll get misaligned agents instead?
We might say: well, for humans, curing cancer requires high levels of agency. But humans are really badly optimised for many types of abstract thinking—hence why we can be beaten at chess so easily. So why can’t we also be beaten at curing cancer by systems less agentic than us?
Eliezer has a bunch of intuitions which tell him where the line of “things we can’t do with non-dangerous systems” should be drawn, which I freely agree I don’t understand (although I will note that it’s suspicious how most people can’t do things on the far side of his line, but Einstein can). But insofar as this post doesn’t consider which side of the line curing cancer is actually on, then I don’t think it’s correctly diagnosed the place where Eliezer and I are bouncing off each other.
For all tasks A-I, most programs that we can imagine writing to do that task will need to search through various actions and evaluating the consequences. The only one of those tasks we currently know how to solve with a program is A, and chess programs do indeed use search and evaluation.
I’d guess that whether something can be done safely is mostly a function of how easy it is, and how isolated it is from the real world. The Riemann hypothesis seems pretty tricky, but it’s isolated from the real world, so it can probably be solved by a safe system. Chess is isolated and easy. Starting a billion dollar company is very entangled with the real world, and very tricky. So we probably couldn’t do it without a dangerous system.
This all makes sense, except for the bit where you draw the line at a certain level of “tricky and entangled with world”. Why isn’t it the case that danger only arises for the first AIs that can do tasks half as tricky? Twice as tricky? Ten times as tricky?
Consider what happens if you had to solve your list of problems and didn’t inherently care about human values? To what extent would you do ‘unfriendly’ things via consequentialism? How hard would you need to be constrained to stop doing that? Would it matter if you could also do far trickier things by using consequentialism and general power-seeking actions?
The reason, as I understand it, that a chess-playing AI does things the way we want it to is that we constrain the search space it can use because we can fully describe that space, rather than having to give it any means of using any other approaches, and for now that box is robust.
But if someone gave you or I the same task, we wouldn’t learn chess, we would buy a copy of Stockfish, or if it was a harder task (e.g. be better than AlphaZero) we’d go acquire resources using consequentialism. And it’s reasonable to think that if we gave a fully generic but powerful future AI the task of being the best at chess, at some point it’s going to figure out that the way to do that is acquire resources via consequentialism, and potentially to kill or destroy all its potential opponents. Winner.
Same with the poem or the hypothesis, I’m not going to be so foolish as to attack the problem directly unless it’s already pretty easy for me. And in order to get an AI to write a poem that good, I find it plausible that the path to doing that is less monkeys on a typewriter and more resource acquisition so I can understand the world well enough to do that. As a programmer of an AI, right now, the path is exactly that—it’s ‘build an AI that gets me enough more funding to potentially get something good enough to write that kind of poem,’ etc.
Another approach, and more directly a response to your question here, is to ask, which is easier for you/the-AI: Solving the problem head-on using only known-safe tactics and existing resources, or seeking power via consequentialism?
Yes, at some amount of endowment, I already have enough resources relative to the problem at hand and see a path to a solution, so I don’t bother looking elsewhere and just solve it, same as a human. But mostly no for anything really worth doing, which is the issue?
I agree with basically your whole comment. But it doesn’t seem like you’re engaging with the frame I’m using. I’m trying to figure out how agentic the first AI that can do task X is, for a range of X (with the hope that the first AI that can do X is not very agentic, for some X that is a pivotal task). The claim that a highly agentic highly intelligent AI will likely do undesirable things when presented with task X is very little evidence about this, because a highly agentic highly intelligent AI will likely do undesirable things when presented with almost any task.
Thank you, that is clarifying, together with your note to Scott on ACX about wanting it to ‘lack a motivational system.’ I want to see if I have this right before I give another shot at answering your actual question.
So as I understand your question now, what you’re asking is, will the first AI that can do (ideally pivotal) task X be of Type A (general, planning, motivational, agentic, models world, intelligent, etc) or Type B (basic, pattern matching, narrow, dumb, domain specific, constrained, boxed, etc).
I almost accidentally labeled A/B as G/N there, and I’m not sure if that’s a fair labeling system and want to see how close the mapping is? (e.g. narrow AI and general AI as usually understood). If not, is there a key difference?
Instead of “dumb” or “narrow” I’d say “having a strong comparative advantage in X (versus humans)”. E.g. imagine watching evolution and asking “will the first animals that take over the world
be able to solvehave already solved the Riemann hypothesis”, and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.Similarly, I don’t expect that any AI which can do a bunch of groundbreaking science to be “narrow” by our current standards, but I do hope that they have a strong comparative disadvantage at taking-over-world-style tasks, compared with doing-science-style tasks.
And that’s related to agency, because what we mean by agency is not far off “having a comparative advantage in taking-over-world style tasks”.
Now, I expect that at some point, this line of reasoning stops being useful, because your systems are general enough and agentic enough that, even if their comparative advantage isn’t taking over the world, they can pretty easily do that anyway. But the question is whether this line of reasoning is still useful for the first systems which can do pivotal task X. Eliezer thinks no, because he considers intelligence and agency to be very strongly linked. I’m less sure, because humans have been evolved really hard to be agentic, so I’d be surprised if you couldn’t beat us at a bunch of intellectual tasks while being much less agentic than us.
Side note: I meant “pattern-matching” as a gesture towards “the bit of general intelligence that doesn’t require agency” (although in hindsight I can see how this is confusing, I’ve just made an edit on the ACX comment).
Pardon the semantics, but I think the question you want to use here is “will the first animals that take over the world have already solved the Riemann hypothesis”. IMO humans do have the ability (“can”) to solve the Riemann hypothesis, and the point you’re making is just about the ordering in which we’ve done things.
Yes, sorry, you’re right; edited.
I’m not sure this is the most useful way to think about it, either, because it includes the possibility that we didn’t solve the Riemann hypothesis first just because we weren’t really interested in it, not because of any kind of inherent difficulty to the problem or our suitability to solving it earlier. I think you’d want to consider:
alternative histories where solving the Riemann hypothesis was a (or the) main goal for humanity, and
alternative histories where world takeover was a (or the) main goal for humanity (our own actual history might be close enough)
and ask if we solve the Riemann hypothesis at earlier average times in worlds like 1 than we take over the world in worlds like 2.
We might also be able to imagine species that could take over the world but seem to have no hope of ever solving the Riemann hypothesis, and I think we want to distinguish that from just happening to not solve it first. Depending on what you mean by “taking over the world”, other animals may have done so before us, too, e.g. arthropods. Or even plants or other forms of life more or before any group of animals, even all animals combined.
No one actually knows the exact task-difficulty threshold, but the intuition is that once a task is hard enough, any AI capable of completing the task is also capable of thinking of strategies that involve betraying its human creators. However, even if I don’t know the exact threshold, I can think of examples that should definitely be above the line. Starting a billion dollar company seems pretty difficult, but it could maybe be achieved by an special-purpose algorithm that just plays the stock market really well. But if we add a few more stipulations, like that the company has to make money by building an actual product, in an established industry with lots of competition, then probably that can only be done by a dangerous algorithm. It’s not a very big step from “figuring out how to outwit your competitors” to “realizing that you could outwit humans in general”.
An implicit assumption here is that I’m drawing the line between “safe” and “dangerous” at the point where the algorithm realizes that it could potentially achieve higher utility by betraying us. It’s possible that an algorithm could realize this, but still not be strong enough to “win” against humanity.
The easiest way is probably to build a modestly-sized company doing software and then find a way to destabilize the government and cause hyperinflation.
I think the rule of thumb should be: if your AI could be intentionally deployed to take over the world, it’s highly likely to do so unintentionally.
My understanding is that you can’t safely do even A with an arbitrarily powerful optimizer. An arbitrarily powerful optimizer who’s reward function is solely “beat the grandmaster” would do everything possible to ensure it’s reward function is maximised with the highest probability. For instance, it might amass as much compute as possible to ensure that it’s made no errors at all, it might armor it’s servers to ensure no one switches it off, and of course, it might pharmacologically mess with the grandmaster to inhibit their performance.
The fact that it can be done safely by a weak AI isn’t to say that it’s safe to do with a powerful AI.
For the purposes of this argument, I’m interested in what can be done safely by some AI we can build. If you can solve alignment safely with some AI, then you’re in a good situation. What an arbitrarily powerful optimiser will do isn’t the crux, we all agree that’s dangerous.
I see what you’re getting at. Interesting question.
Yes, it definitely doesn’t work with A or C. It might work with B, because judging whether a poem is Shakespeare-level or not is heavily entangled with human society and culture and it may turn out that manipulating humans to rave about whatever you wrote (whether it’s actually Shakespeare-level poetry or not) might be easier. I expect not, but it’s hard to be sure. I would certainly put C as safer than B.
Everything else is obviously much more dangerous.
That was my intuition as well. A and C are just not entangled with the physical world at all. B is a maybe; it’s a big leap from poetry to taking over the world, but humans are something that has to be modelled and that’s where trouble starts.
Looking at the A substitution, why doesn’t this argument work?
I think by “win a chess game against a grandmaster” you are specifically asking about the game itself. In real life we also have to arrange the game, stay alive until the game, etc. Let’s take all that out of scope, it’s obviously unsafe.
Well it definitely doesn’t give you a plan like “If the grandmaster plays e4, you play e5, and then if they play d4, you play f5, …” because that plan is too large. I think the desired outcome is a plan like “open with pawn to d4, observe the board position, then ask for another plan”. Are Oracle AIs allowed to provide self-referential plans?
Regardless, if I’m an Oracle AI looking for the most likely plan, I’m now very concerned that you’ll have a heart attack, or an attack of arrogance, or otherwise mess up my perfect plan. Unlikely, sure, but I’m searching for the most “likely to work” here. So the actual plan I give you is “ask the grandmaster how his trip to Madrid went, then ask me for another plan”. Then the grandmaster realizes that I know about his(*eg) affair and will reveal it if he wins, and he attempts to lose as gracefully as possible. So now the outcome is much more robust to events.
I agree that highly agentic versions of the system will complete the tasks better. My claim is just that they’re not necessary to complete the task very well, and so we shouldn’t be confident that selection for completing that task very well will end up producing the highly agentic versions.
That helps, thanks. Raemon says:
I now read you as pointing to chess as:
It is “hard to accomplish” from the perspective of human cognition.
It does not require a “powerful”/”agentic” plan.
It’s “easy” to specify a search for a good plan, we already did it.
So maybe alignment is like that.
Yepp. And clearly alignment is much harder than chess, but it seems like an open question whether it’s harder than “kill everyone” (and even if it is, there’s an open question of how much of an advantage we get from doing our best to point the system at the former not the latter).
“Kill everyone” seems like it should be “easy”, because there are so many ways to do it: humans only survive in environments with a specific range of temperatures, pressures, atmospheric contents, availability of human-digestible food, &c.
I agree the argument doesn’t work for A,
B,and C, but I think the way it doesn’t work should make you pessimistic about how much we can trust the outputted plans in more complex task domains.For A, it doesn’t seem certain to me that the AI will only generate plans which only involve making a chess move. It has no reason to prefer simpler plans over more complex, and it may gain a lot by suggesting that the player, for instance, lobotomize their opponent, or hook it up to some agent AI in a way such that most worlds lead to it playing against a grandmaster with a lobotomy.
If you penalize the AI significantly for thought cycles, it will just output 100 different ways of connecting itself to an agent (or otherwise producing an optimization process which achieves it’s goal). If you don’t penalize it very much for thought cycles it will come up with a way to win against it’s opponent, then add on a bunch of ways to ensure they’re lobotomized before the match.
Most ways of defining these goals seems like it always leads to most action sequences or all of them being bad, or having bad components given too little or too much penalization for thought cycles, so as your ability to foresee the consequences of the actions taken decreases, you should also dramatically decrease the expected value of any particular generated plan. This means that in domains where the agent is actually useful, action plans which are easy to verify without actually executing are the only ones which can be used. This means D, E, F, G, I, and possibly H depending on the form the solution takes all pose astronomical risks.
Another possible solution would be to estimate how many thought cycles it should take to solve the problem, as well as how accurate that estimate needs to be in order to not result in optimisers or lobotomies, then only use solutions in that range.
Edit: the point is that the simpler cases don’t work because it’s very easy to verify the actions are in an action space which is easy to verify won’t lead to catastrophic failure. For A you can just make sure the action space is that of chess moves, for C of chess proofs.
But I think Richard’s point is ‘but we totally built AIs that defeated grand chess masters without destroying the world. So, clearly it’s possible to use tool AI to do this sort of thing. So… why do think various domains will reliably output horrible outcomes? If you need to cure cancer, maybe there is an analogous way to cure cancer that just… isn’t trying that hard?’
Richard is that what you were aiming at?
The reason why we can easily make AIs which solve chess without destroying the world is because we can make specialized AIs such that they can only operate in the theoretical environment of states of chess boards, and in that environment we can tell it exactly what it’s goal should be.
If we tell an AGI to generate plans for winning at chess, and it knows about the outside world, then because the state space is astronomically larger, it is astronomically more difficult to tell it what it’s goal should be, and so any goal we do give it either satisfies corrigibility, and we can tell it “do what I want”, or incompletely captures what we mean by ‘win this chess game’.
For cancer, there may well be a way to solve the problem using a specialized AI, which works in an environment space simple enough that we can completely specify our goal. I assume though that we are using a general AI in all the hypothetical versions of the problem, which has the property ‘it’s working in an environment space large enough that we can’t specify what we want it to do’ or if it doesn’t know a priori it’s plans can affect the state of a far larger environment space which can affect the environment space it cares about, it may deduce this, and figure out a way to exploit this feature.
I might be conflating Richard, Paul, and my own guesses here. But, I think part of the argument here is about what can happen before AGI, that gives us lines of hope to pursue.
Like, my-model-of-Paul wants various tools for amplifying his own thought to (among other things) help think about solving the longterm alignment problem. And the question is whether there are ways of doing that that actually help when trying to solve the sorts of problems Paul wants to solve. We’ve successfully augmented human arithmetic and chess. Are there tools we actually wish we had, that narrow AI meaningfully helps with,
I’m not sure if Richard has a particular strategy in mind, but I assume he’s exploring the broader question of “what useful things can we build that will help navigate x-risk”
The original dialogs were exploring the concept of pivotal acts that could change humanity’s strategic position. Are there AIs that can execute pivotal acts that are more like calculators and Deep Blue than like autonomous moon-base-builders? (I don’t know if Richard actually shares the pivotal act / acute risk period frame, or was just accepting it for sake of argument)
The problem is not with whether we call the AI AGI or not, it’s whether we can either 1) fully specify our goals in the environment space it’s able to model (or otherwise not care too deeply about the environment space it’s able to model), or 2) verify the effects of the actions it says to do have no disastrous consequences.
To determine whether a tool AI can be used to solve problems Paul wants to solve, or execute pivotal acts, we need a to both 1) determine that the environment is small enough for us to accurately express our goal, and 2) ensure the AI is unable to infer the existence of a broader environment.
(meta note: I’m making a lot of very confident statements, and very few are of the form “<statement>, unless <other statement>, then <statement> may not be true”. This means I am almost certainly overconfident, and my model is incomplete, but I’m making the claims anyway so that they can be developed)
This is what I came here to say! I think you point out a crisp reason why some task settings make alignment harder than others, and why we get catastrophically optimized against by some kinds of smart agents but not others (like Deep Blue).