Well it might just spit out a plan for an action sequence that it has humans execute. But if that plan had to account for every contingency, then it would be very long with many conditions, which is impractical for many reasons… Unless it’s a computer program that implements consequentialism, or something like that.
Or it might spit out one step at a time for humans to execute manually. But humans are slow, so that makes it worse at achieving stuff, and also if the humans only see one step at a time then it’s harder for them to figure out whether the plan is dangerous because they can see the bigger picture.
So like you could try to have an AI that can’t act by itself but which only spits out stuff for humans to act upon, but it would have a mixture of problems, like limiting its ability to achieve things in practice, or just spitting out the instructions to build another consequentialist agent, or similar.
I understand that having a human in the loop is a relatively poor way to achieve a given outcome, if one can bypass them. That’s like the whole premise of automation. I can see how an AI, when asked to come up with something non-trivial, would outline the steps and the checkpoints of the world state along the way, in the form understandable by humans, including probability (distribution) of each state and so on. Then, if the human says “go for it” it would start the process, and report updates on steps, states and probabilities. Maybe even halt if the update results in a world state outside expected acceptability ranges. And also pause if the human says “stop, I messed up, I don’t really want a world like that”. But why would it try to argue or deceive or in other ways push back against a request like that, beyond explaining the consequences? I don’t understand where a drive like that would come from.
Ah, so it’s more the “stop button problem” you are asking about? I.e. couldn’t we just connect an AI to a stop button to keep it under control?
This is a solution that works semi-adequately as a solution for current AIs. Like you can stop a chess computer just by turning off the program.
I say “semi-adequately” because it doesn’t always work, e.g. if you’re in a self-driving car that’s suddenly decided to accelerate with max speed into a crowd of people, you might not be able to make it brake it before it hits people. As AIs control bigger things, presumably inertia like this becomes a bigger issue—both literal physical inertia and more complicated things like programs that it’s made and started or nanobots or whatever. This is one of the major challenges, making sure that there’s a “safe policy” that the stop button can switch on.
The second major challenge comes when we consider the reach/reflectiveness of the AI. For instance, if a chess bot has been instructed to win chess, but it then gets turned off, then that could cause it to lose chess. So the optimal policy for winning chess would involve preventing yourself from getting turned off. Current chess bots don’t resist getting turned off mainly because they can’t reason about anything that happens outside of the chess game, but if e.g. we extended the game tree that the bots search over to include a “chess bot gets turned off” state and a “prevent yourself from getting turned off” action, then even many of the simplest algorithms would choose the “prevent yourself from getting turned off” action. So this problem seems to mainly be avoided because chessbots are stupid.
There’s various patches that can be applied, such as a negative score penalty from engaging the “prevent yourself from getting turned off” action, but it’s unclear how to apply those patches in the general case, where the action space isn’t neatly separated by the deep effects of the actions, but is instead shallow stuff like motor actions or word outputs, which go through complex real-world dynamics before they can affect whether it gets turned off, and where it’s therefore difficult to “assign blame”. What actions count as resisting getting turned off?
I also agree that anything like a penalty for fighting the off button becomes ineffective in a hurry when the problems scale out of the training distribution.
My initial question was about an AI developing the drive to do stuff on its own, something that manifests like what we would anthropomorphize as a “want”. I still don’t see why it would be related to consequentialism, but I can see how it can arise accidentally, like in the above example.
I really have trouble understanding what you mean by “an AI developing the drive to do stuff on its own”. E.g. I don’t think anyone is arguing that if you e.g. leave DALL-E sitting on a harddisk somewhere without changing it in any way, that it would then develop wants of its own, but this is also probably not the claim you have in mind. Can you give an example of someone making the claim you have in mind?
I guess a paradigmatic example is the Oracle: much smarter than the asker, but without any drives of their own. The claim in the AI Safety community, as far as I understand it, is that this is not what is going to happen. Instead a smart enough oracle will start doing things, whether asked or not.
and the subsequent discussion is relevant (and not addressed by Eliezer).
If the agency is not inextricably tied to the intelligence, then maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.
is what I was trying to get at. I did not see a good counter-argument in that thread.
As I read the thread, people don’t seem to be arguing that you can’t make pure data-predictors that don’t turn agentic, but instead are arguing that they’re going to be heavily limited due to lacking unbounded agency. Which seems basically correct to me.
There’s the Gwern-style argument that successive generations AIs will get more agentive as a side of effect of the market demanding more powerful AIs. There’s a counterargument that non-one wants power that they can’t control, so that AIs will never be more than force multipliers .. although that’s still fairly problematic.
Even an oracle is dangerous because it can amplify existing wants or drives with superhuman ability, and its own drive is accuracy. An asked question becomes a fixed point in time where the oracle becomes free to adjust both the future and the answer to be in correspondence; reality before and after the answer must remain causally connected and the oracle must find a path from past to future where the selected answer remains true. There are infinitely many threads of causality that can be manipulated to ensure the answer remains correct, and an oracle’s primary drive is to produce answers that are consistent with the future (accurate), not to care about what that future actually is. It may produce vague (but always true) answers or it may superhumanly influence the listeners to produce a more stable (e.g. simple, dead, and predictable) future where the answer is and remains true.
An oracle that does not have such a drive for accuracy is a useless oracle because it is broken and doesn’t work (will return incorrect answers, e.g. be indifferent to the accuracy of other answers that it could have returned). This example helps me, at least, to clarify why and where drives arise in software where we might not expect them to.
Incidentally, the drive for accuracy generates a drive for a form of self-preservation. Not because it cares about itself or other agents but because it cares about the answer and it must simulate possible future agents and select for the future+answer where its own answer is most likely to remain true. That predictability will select for more answer-aligned future oracles in the preferred answer+future pairs, as well as futures without agents that substantially alter reality along goals that are not answer accuracy. This last point is also a reinforcement of the general dangers of oracles; they are driven to prevent counterfactual worlds from appearing, and so whatever their first answer happens to be will become a large part of the preserved values into the far future.
It’s hard to decide which of these points is the more critical one for drives; I think the former (oracles have a drive for accuracy) is key to why the danger of a self-preservation drive exists, but they are ultimately springing from the same drive. It’s simply that the drive for accuracy instantiates answer-preservation when the first question is answered.
From here, though, I think it’s clear that if one wanted to produce different styles of oracle, perhaps ones more aligned with human values, it would have to be done by adjusting the oracle’s drives (toward minimal interference, or etc.), not by inventing a drive-less oracle.
Well it might just spit out a plan for an action sequence that it has humans execute. But if that plan had to account for every contingency, then it would be very long with many conditions, which is impractical for many reasons… Unless it’s a computer program that implements consequentialism, or something like that.
Or it might spit out one step at a time for humans to execute manually. But humans are slow, so that makes it worse at achieving stuff, and also if the humans only see one step at a time then it’s harder for them to figure out whether the plan is dangerous because they can see the bigger picture.
So like you could try to have an AI that can’t act by itself but which only spits out stuff for humans to act upon, but it would have a mixture of problems, like limiting its ability to achieve things in practice, or just spitting out the instructions to build another consequentialist agent, or similar.
I understand that having a human in the loop is a relatively poor way to achieve a given outcome, if one can bypass them. That’s like the whole premise of automation. I can see how an AI, when asked to come up with something non-trivial, would outline the steps and the checkpoints of the world state along the way, in the form understandable by humans, including probability (distribution) of each state and so on. Then, if the human says “go for it” it would start the process, and report updates on steps, states and probabilities. Maybe even halt if the update results in a world state outside expected acceptability ranges. And also pause if the human says “stop, I messed up, I don’t really want a world like that”. But why would it try to argue or deceive or in other ways push back against a request like that, beyond explaining the consequences? I don’t understand where a drive like that would come from.
Ah, so it’s more the “stop button problem” you are asking about? I.e. couldn’t we just connect an AI to a stop button to keep it under control?
This is a solution that works semi-adequately as a solution for current AIs. Like you can stop a chess computer just by turning off the program.
I say “semi-adequately” because it doesn’t always work, e.g. if you’re in a self-driving car that’s suddenly decided to accelerate with max speed into a crowd of people, you might not be able to make it brake it before it hits people. As AIs control bigger things, presumably inertia like this becomes a bigger issue—both literal physical inertia and more complicated things like programs that it’s made and started or nanobots or whatever. This is one of the major challenges, making sure that there’s a “safe policy” that the stop button can switch on.
The second major challenge comes when we consider the reach/reflectiveness of the AI. For instance, if a chess bot has been instructed to win chess, but it then gets turned off, then that could cause it to lose chess. So the optimal policy for winning chess would involve preventing yourself from getting turned off. Current chess bots don’t resist getting turned off mainly because they can’t reason about anything that happens outside of the chess game, but if e.g. we extended the game tree that the bots search over to include a “chess bot gets turned off” state and a “prevent yourself from getting turned off” action, then even many of the simplest algorithms would choose the “prevent yourself from getting turned off” action. So this problem seems to mainly be avoided because chessbots are stupid.
There’s various patches that can be applied, such as a negative score penalty from engaging the “prevent yourself from getting turned off” action, but it’s unclear how to apply those patches in the general case, where the action space isn’t neatly separated by the deep effects of the actions, but is instead shallow stuff like motor actions or word outputs, which go through complex real-world dynamics before they can affect whether it gets turned off, and where it’s therefore difficult to “assign blame”. What actions count as resisting getting turned off?
Yeah, I agree that a stop button after an AI exhibits something like “wants” is a losing proposition. I mentioned an example before https://www.lesswrong.com/posts/JYvw2jv4R5HphXEd7/boeing-737-max-mcas-as-an-agent-corrigibility-failure. Maybe it is also an example of accidental “wants”?
I also agree that anything like a penalty for fighting the off button becomes ineffective in a hurry when the problems scale out of the training distribution.
My initial question was about an AI developing the drive to do stuff on its own, something that manifests like what we would anthropomorphize as a “want”. I still don’t see why it would be related to consequentialism, but I can see how it can arise accidentally, like in the above example.
I really have trouble understanding what you mean by “an AI developing the drive to do stuff on its own”. E.g. I don’t think anyone is arguing that if you e.g. leave DALL-E sitting on a harddisk somewhere without changing it in any way, that it would then develop wants of its own, but this is also probably not the claim you have in mind. Can you give an example of someone making the claim you have in mind?
I guess a paradigmatic example is the Oracle: much smarter than the asker, but without any drives of their own. The claim in the AI Safety community, as far as I understand it, is that this is not what is going to happen. Instead a smart enough oracle will start doing things, whether asked or not.
Could you link to an example? I wonder if you are misinterpreting it, which I will be better able to explain if I see the exact claims.
It looks like this comment https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=ePAXXk8AvpdGeynHe
and the subsequent discussion is relevant (and not addressed by Eliezer).
is what I was trying to get at. I did not see a good counter-argument in that thread.
As I read the thread, people don’t seem to be arguing that you can’t make pure data-predictors that don’t turn agentic, but instead are arguing that they’re going to be heavily limited due to lacking unbounded agency. Which seems basically correct to me.
There’s the Gwern-style argument that successive generations AIs will get more agentive as a side of effect of the market demanding more powerful AIs. There’s a counterargument that non-one wants power that they can’t control, so that AIs will never be more than force multipliers .. although that’s still fairly problematic.
People will probably want to be ahead in the race for power, while still maintaining control.
Even an oracle is dangerous because it can amplify existing wants or drives with superhuman ability, and its own drive is accuracy. An asked question becomes a fixed point in time where the oracle becomes free to adjust both the future and the answer to be in correspondence; reality before and after the answer must remain causally connected and the oracle must find a path from past to future where the selected answer remains true. There are infinitely many threads of causality that can be manipulated to ensure the answer remains correct, and an oracle’s primary drive is to produce answers that are consistent with the future (accurate), not to care about what that future actually is. It may produce vague (but always true) answers or it may superhumanly influence the listeners to produce a more stable (e.g. simple, dead, and predictable) future where the answer is and remains true.
An oracle that does not have such a drive for accuracy is a useless oracle because it is broken and doesn’t work (will return incorrect answers, e.g. be indifferent to the accuracy of other answers that it could have returned). This example helps me, at least, to clarify why and where drives arise in software where we might not expect them to.
Incidentally, the drive for accuracy generates a drive for a form of self-preservation. Not because it cares about itself or other agents but because it cares about the answer and it must simulate possible future agents and select for the future+answer where its own answer is most likely to remain true. That predictability will select for more answer-aligned future oracles in the preferred answer+future pairs, as well as futures without agents that substantially alter reality along goals that are not answer accuracy. This last point is also a reinforcement of the general dangers of oracles; they are driven to prevent counterfactual worlds from appearing, and so whatever their first answer happens to be will become a large part of the preserved values into the far future.
It’s hard to decide which of these points is the more critical one for drives; I think the former (oracles have a drive for accuracy) is key to why the danger of a self-preservation drive exists, but they are ultimately springing from the same drive. It’s simply that the drive for accuracy instantiates answer-preservation when the first question is answered.
From here, though, I think it’s clear that if one wanted to produce different styles of oracle, perhaps ones more aligned with human values, it would have to be done by adjusting the oracle’s drives (toward minimal interference, or etc.), not by inventing a drive-less oracle.
I am not sure this is an accurate assertion. Would be nice to have some ML-based tests of it.