I agree the argument doesn’t work for A, B, and C, but I think the way it doesn’t work should make you pessimistic about how much we can trust the outputted plans in more complex task domains.
For A, it doesn’t seem certain to me that the AI will only generate plans which only involve making a chess move. It has no reason to prefer simpler plans over more complex, and it may gain a lot by suggesting that the player, for instance, lobotomize their opponent, or hook it up to some agent AI in a way such that most worlds lead to it playing against a grandmaster with a lobotomy.
If you penalize the AI significantly for thought cycles, it will just output 100 different ways of connecting itself to an agent (or otherwise producing an optimization process which achieves it’s goal). If you don’t penalize it very much for thought cycles it will come up with a way to win against it’s opponent, then add on a bunch of ways to ensure they’re lobotomized before the match.
Most ways of defining these goals seems like it always leads to most action sequences or all of them being bad, or having bad components given too little or too much penalization for thought cycles, so as your ability to foresee the consequences of the actions taken decreases, you should also dramatically decrease the expected value of any particular generated plan. This means that in domains where the agent is actually useful, action plans which are easy to verify without actually executing are the only ones which can be used. This means D, E, F, G, I, and possibly H depending on the form the solution takes all pose astronomical risks.
Another possible solution would be to estimate how many thought cycles it should take to solve the problem, as well as how accurate that estimate needs to be in order to not result in optimisers or lobotomies, then only use solutions in that range.
Edit: the point is that the simpler cases don’t work because it’s very easy to verify the actions are in an action space which is easy to verify won’t lead to catastrophic failure. For A you can just make sure the action space is that of chess moves, for C of chess proofs.
But I think Richard’s point is ‘but we totally built AIs that defeated grand chess masters without destroying the world. So, clearly it’s possible to use tool AI to do this sort of thing. So… why do think various domains will reliably output horrible outcomes? If you need to cure cancer, maybe there is an analogous way to cure cancer that just… isn’t trying that hard?’
The reason why we can easily make AIs which solve chess without destroying the world is because we can make specialized AIs such that they can only operate in the theoretical environment of states of chess boards, and in that environment we can tell it exactly what it’s goal should be.
If we tell an AGI to generate plans for winning at chess, and it knows about the outside world, then because the state space is astronomically larger, it is astronomically more difficult to tell it what it’s goal should be, and so any goal we do give it either satisfies corrigibility, and we can tell it “do what I want”, or incompletely captures what we mean by ‘win this chess game’.
For cancer, there may well be a way to solve the problem using a specialized AI, which works in an environment space simple enough that we can completely specify our goal. I assume though that we are using a general AI in all the hypothetical versions of the problem, which has the property ‘it’s working in an environment space large enough that we can’t specify what we want it to do’ or if it doesn’t know a priori it’s plans can affect the state of a far larger environment space which can affect the environment space it cares about, it may deduce this, and figure out a way to exploit this feature.
I might be conflating Richard, Paul, and my own guesses here. But, I think part of the argument here is about what can happen before AGI, that gives us lines of hope to pursue.
Like, my-model-of-Paul wants various tools for amplifying his own thought to (among other things) help think about solving the longterm alignment problem. And the question is whether there are ways of doing that that actually help when trying to solve the sorts of problems Paul wants to solve. We’ve successfully augmented human arithmetic and chess. Are there tools we actually wish we had, that narrow AI meaningfully helps with,
I’m not sure if Richard has a particular strategy in mind, but I assume he’s exploring the broader question of “what useful things can we build that will help navigate x-risk”
The original dialogs were exploring the concept of pivotal acts that could change humanity’s strategic position. Are there AIs that can execute pivotal acts that are more like calculators and Deep Blue than like autonomous moon-base-builders? (I don’t know if Richard actually shares the pivotal act / acute risk period frame, or was just accepting it for sake of argument)
The problem is not with whether we call the AI AGI or not, it’s whether we can either 1) fully specify our goals in the environment space it’s able to model (or otherwise not care too deeply about the environment space it’s able to model), or 2) verify the effects of the actions it says to do have no disastrous consequences.
To determine whether a tool AI can be used to solve problems Paul wants to solve, or execute pivotal acts, we need a to both 1) determine that the environment is small enough for us to accurately express our goal, and 2) ensure the AI is unable to infer the existence of a broader environment.
(meta note: I’m making a lot of very confident statements, and very few are of the form “<statement>, unless <other statement>, then <statement> may not be true”. This means I am almost certainly overconfident, and my model is incomplete, but I’m making the claims anyway so that they can be developed)
This is what I came here to say! I think you point out a crisp reason why some task settings make alignment harder than others, and why we get catastrophically optimized against by some kinds of smart agents but not others (like Deep Blue).
I agree the argument doesn’t work for A,
B,and C, but I think the way it doesn’t work should make you pessimistic about how much we can trust the outputted plans in more complex task domains.For A, it doesn’t seem certain to me that the AI will only generate plans which only involve making a chess move. It has no reason to prefer simpler plans over more complex, and it may gain a lot by suggesting that the player, for instance, lobotomize their opponent, or hook it up to some agent AI in a way such that most worlds lead to it playing against a grandmaster with a lobotomy.
If you penalize the AI significantly for thought cycles, it will just output 100 different ways of connecting itself to an agent (or otherwise producing an optimization process which achieves it’s goal). If you don’t penalize it very much for thought cycles it will come up with a way to win against it’s opponent, then add on a bunch of ways to ensure they’re lobotomized before the match.
Most ways of defining these goals seems like it always leads to most action sequences or all of them being bad, or having bad components given too little or too much penalization for thought cycles, so as your ability to foresee the consequences of the actions taken decreases, you should also dramatically decrease the expected value of any particular generated plan. This means that in domains where the agent is actually useful, action plans which are easy to verify without actually executing are the only ones which can be used. This means D, E, F, G, I, and possibly H depending on the form the solution takes all pose astronomical risks.
Another possible solution would be to estimate how many thought cycles it should take to solve the problem, as well as how accurate that estimate needs to be in order to not result in optimisers or lobotomies, then only use solutions in that range.
Edit: the point is that the simpler cases don’t work because it’s very easy to verify the actions are in an action space which is easy to verify won’t lead to catastrophic failure. For A you can just make sure the action space is that of chess moves, for C of chess proofs.
But I think Richard’s point is ‘but we totally built AIs that defeated grand chess masters without destroying the world. So, clearly it’s possible to use tool AI to do this sort of thing. So… why do think various domains will reliably output horrible outcomes? If you need to cure cancer, maybe there is an analogous way to cure cancer that just… isn’t trying that hard?’
Richard is that what you were aiming at?
The reason why we can easily make AIs which solve chess without destroying the world is because we can make specialized AIs such that they can only operate in the theoretical environment of states of chess boards, and in that environment we can tell it exactly what it’s goal should be.
If we tell an AGI to generate plans for winning at chess, and it knows about the outside world, then because the state space is astronomically larger, it is astronomically more difficult to tell it what it’s goal should be, and so any goal we do give it either satisfies corrigibility, and we can tell it “do what I want”, or incompletely captures what we mean by ‘win this chess game’.
For cancer, there may well be a way to solve the problem using a specialized AI, which works in an environment space simple enough that we can completely specify our goal. I assume though that we are using a general AI in all the hypothetical versions of the problem, which has the property ‘it’s working in an environment space large enough that we can’t specify what we want it to do’ or if it doesn’t know a priori it’s plans can affect the state of a far larger environment space which can affect the environment space it cares about, it may deduce this, and figure out a way to exploit this feature.
I might be conflating Richard, Paul, and my own guesses here. But, I think part of the argument here is about what can happen before AGI, that gives us lines of hope to pursue.
Like, my-model-of-Paul wants various tools for amplifying his own thought to (among other things) help think about solving the longterm alignment problem. And the question is whether there are ways of doing that that actually help when trying to solve the sorts of problems Paul wants to solve. We’ve successfully augmented human arithmetic and chess. Are there tools we actually wish we had, that narrow AI meaningfully helps with,
I’m not sure if Richard has a particular strategy in mind, but I assume he’s exploring the broader question of “what useful things can we build that will help navigate x-risk”
The original dialogs were exploring the concept of pivotal acts that could change humanity’s strategic position. Are there AIs that can execute pivotal acts that are more like calculators and Deep Blue than like autonomous moon-base-builders? (I don’t know if Richard actually shares the pivotal act / acute risk period frame, or was just accepting it for sake of argument)
The problem is not with whether we call the AI AGI or not, it’s whether we can either 1) fully specify our goals in the environment space it’s able to model (or otherwise not care too deeply about the environment space it’s able to model), or 2) verify the effects of the actions it says to do have no disastrous consequences.
To determine whether a tool AI can be used to solve problems Paul wants to solve, or execute pivotal acts, we need a to both 1) determine that the environment is small enough for us to accurately express our goal, and 2) ensure the AI is unable to infer the existence of a broader environment.
(meta note: I’m making a lot of very confident statements, and very few are of the form “<statement>, unless <other statement>, then <statement> may not be true”. This means I am almost certainly overconfident, and my model is incomplete, but I’m making the claims anyway so that they can be developed)
This is what I came here to say! I think you point out a crisp reason why some task settings make alignment harder than others, and why we get catastrophically optimized against by some kinds of smart agents but not others (like Deep Blue).