I really have trouble understanding what you mean by “an AI developing the drive to do stuff on its own”. E.g. I don’t think anyone is arguing that if you e.g. leave DALL-E sitting on a harddisk somewhere without changing it in any way, that it would then develop wants of its own, but this is also probably not the claim you have in mind. Can you give an example of someone making the claim you have in mind?
I guess a paradigmatic example is the Oracle: much smarter than the asker, but without any drives of their own. The claim in the AI Safety community, as far as I understand it, is that this is not what is going to happen. Instead a smart enough oracle will start doing things, whether asked or not.
and the subsequent discussion is relevant (and not addressed by Eliezer).
If the agency is not inextricably tied to the intelligence, then maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.
is what I was trying to get at. I did not see a good counter-argument in that thread.
As I read the thread, people don’t seem to be arguing that you can’t make pure data-predictors that don’t turn agentic, but instead are arguing that they’re going to be heavily limited due to lacking unbounded agency. Which seems basically correct to me.
There’s the Gwern-style argument that successive generations AIs will get more agentive as a side of effect of the market demanding more powerful AIs. There’s a counterargument that non-one wants power that they can’t control, so that AIs will never be more than force multipliers .. although that’s still fairly problematic.
Even an oracle is dangerous because it can amplify existing wants or drives with superhuman ability, and its own drive is accuracy. An asked question becomes a fixed point in time where the oracle becomes free to adjust both the future and the answer to be in correspondence; reality before and after the answer must remain causally connected and the oracle must find a path from past to future where the selected answer remains true. There are infinitely many threads of causality that can be manipulated to ensure the answer remains correct, and an oracle’s primary drive is to produce answers that are consistent with the future (accurate), not to care about what that future actually is. It may produce vague (but always true) answers or it may superhumanly influence the listeners to produce a more stable (e.g. simple, dead, and predictable) future where the answer is and remains true.
An oracle that does not have such a drive for accuracy is a useless oracle because it is broken and doesn’t work (will return incorrect answers, e.g. be indifferent to the accuracy of other answers that it could have returned). This example helps me, at least, to clarify why and where drives arise in software where we might not expect them to.
Incidentally, the drive for accuracy generates a drive for a form of self-preservation. Not because it cares about itself or other agents but because it cares about the answer and it must simulate possible future agents and select for the future+answer where its own answer is most likely to remain true. That predictability will select for more answer-aligned future oracles in the preferred answer+future pairs, as well as futures without agents that substantially alter reality along goals that are not answer accuracy. This last point is also a reinforcement of the general dangers of oracles; they are driven to prevent counterfactual worlds from appearing, and so whatever their first answer happens to be will become a large part of the preserved values into the far future.
It’s hard to decide which of these points is the more critical one for drives; I think the former (oracles have a drive for accuracy) is key to why the danger of a self-preservation drive exists, but they are ultimately springing from the same drive. It’s simply that the drive for accuracy instantiates answer-preservation when the first question is answered.
From here, though, I think it’s clear that if one wanted to produce different styles of oracle, perhaps ones more aligned with human values, it would have to be done by adjusting the oracle’s drives (toward minimal interference, or etc.), not by inventing a drive-less oracle.
I really have trouble understanding what you mean by “an AI developing the drive to do stuff on its own”. E.g. I don’t think anyone is arguing that if you e.g. leave DALL-E sitting on a harddisk somewhere without changing it in any way, that it would then develop wants of its own, but this is also probably not the claim you have in mind. Can you give an example of someone making the claim you have in mind?
I guess a paradigmatic example is the Oracle: much smarter than the asker, but without any drives of their own. The claim in the AI Safety community, as far as I understand it, is that this is not what is going to happen. Instead a smart enough oracle will start doing things, whether asked or not.
Could you link to an example? I wonder if you are misinterpreting it, which I will be better able to explain if I see the exact claims.
It looks like this comment https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=ePAXXk8AvpdGeynHe
and the subsequent discussion is relevant (and not addressed by Eliezer).
is what I was trying to get at. I did not see a good counter-argument in that thread.
As I read the thread, people don’t seem to be arguing that you can’t make pure data-predictors that don’t turn agentic, but instead are arguing that they’re going to be heavily limited due to lacking unbounded agency. Which seems basically correct to me.
There’s the Gwern-style argument that successive generations AIs will get more agentive as a side of effect of the market demanding more powerful AIs. There’s a counterargument that non-one wants power that they can’t control, so that AIs will never be more than force multipliers .. although that’s still fairly problematic.
People will probably want to be ahead in the race for power, while still maintaining control.
Even an oracle is dangerous because it can amplify existing wants or drives with superhuman ability, and its own drive is accuracy. An asked question becomes a fixed point in time where the oracle becomes free to adjust both the future and the answer to be in correspondence; reality before and after the answer must remain causally connected and the oracle must find a path from past to future where the selected answer remains true. There are infinitely many threads of causality that can be manipulated to ensure the answer remains correct, and an oracle’s primary drive is to produce answers that are consistent with the future (accurate), not to care about what that future actually is. It may produce vague (but always true) answers or it may superhumanly influence the listeners to produce a more stable (e.g. simple, dead, and predictable) future where the answer is and remains true.
An oracle that does not have such a drive for accuracy is a useless oracle because it is broken and doesn’t work (will return incorrect answers, e.g. be indifferent to the accuracy of other answers that it could have returned). This example helps me, at least, to clarify why and where drives arise in software where we might not expect them to.
Incidentally, the drive for accuracy generates a drive for a form of self-preservation. Not because it cares about itself or other agents but because it cares about the answer and it must simulate possible future agents and select for the future+answer where its own answer is most likely to remain true. That predictability will select for more answer-aligned future oracles in the preferred answer+future pairs, as well as futures without agents that substantially alter reality along goals that are not answer accuracy. This last point is also a reinforcement of the general dangers of oracles; they are driven to prevent counterfactual worlds from appearing, and so whatever their first answer happens to be will become a large part of the preserved values into the far future.
It’s hard to decide which of these points is the more critical one for drives; I think the former (oracles have a drive for accuracy) is key to why the danger of a self-preservation drive exists, but they are ultimately springing from the same drive. It’s simply that the drive for accuracy instantiates answer-preservation when the first question is answered.
From here, though, I think it’s clear that if one wanted to produce different styles of oracle, perhaps ones more aligned with human values, it would have to be done by adjusting the oracle’s drives (toward minimal interference, or etc.), not by inventing a drive-less oracle.
I am not sure this is an accurate assertion. Would be nice to have some ML-based tests of it.