I am saying that the AI may decide to not resist shutdown in order to not raise suspicion that it is not misaligned.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding. I.e. if you use “is the AI resisting” as a proxy for “is this AI goal-directly misaligned” then the AI will try pretty hard to not look like its resisting, in my models (and think humans would act similarly).
As I said above, I do expect you will get some AIs to resist shutdown (indeed you can get current AIs to resist shutdown already). I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding
I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:
To the extent you’re using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don’t see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that:
(a) you appear to be treating misaligned AIs as a natural class, such that “AI takeover” is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to “anything that isn’t aligned with humans”. A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there’s little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I’m being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.
(b) in real life, it seems pretty rare for these considerations to play a large role in people’s decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
This seems like a misleading comparison, because human conspiracies usually don’t try to convince the government that they’re perfectly obedient slaves even unto death, because everyone already knows that humans aren’t actually like that. If we imagine a human conspiracy where there is some sort of widespread deception like this, it seems more plausible that they would try to continue to be deceptive even in the face of death(like, maybe, uh, some group of people are pretending to be fervently religious and have no fear of death, or something)
To be clear, the thing that I am saying (and I think I have said multiple times) is that I expect you will find some AIs who will stay quiet, and some who will more openly resist. I would be surprised if we completely fail to find either class. But that means that any individual case of AIs not appearing to resist is not that much bayesian evidence.
I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
This seems distinct from an “anything could happen”-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
If instead you meant to make an “anything could happen”-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I’m not claiming you’re doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that (that I am pretty sure are shared by many others as well).
We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It’s not like you have a theory that is in any way more constraining here.
And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
I mean, I expect the observations to be affected by both, of course. That’s one of the key things that makes predictions in the space so messy.
I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification.
For what it’s worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that
I don’t think what you said really counts as a “correction” so much as a counter-argument. I think it’s reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.
Hmm, I think we did indeed miscommunicate.
I am saying that the AI may decide to not resist shutdown in order to not raise suspicion that it is not misaligned.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding. I.e. if you use “is the AI resisting” as a proxy for “is this AI goal-directly misaligned” then the AI will try pretty hard to not look like its resisting, in my models (and think humans would act similarly).
As I said above, I do expect you will get some AIs to resist shutdown (indeed you can get current AIs to resist shutdown already). I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:
To the extent you’re using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don’t see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that:
(a) you appear to be treating misaligned AIs as a natural class, such that “AI takeover” is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to “anything that isn’t aligned with humans”. A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there’s little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I’m being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.
(b) in real life, it seems pretty rare for these considerations to play a large role in people’s decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
This seems like a misleading comparison, because human conspiracies usually don’t try to convince the government that they’re perfectly obedient slaves even unto death, because everyone already knows that humans aren’t actually like that. If we imagine a human conspiracy where there is some sort of widespread deception like this, it seems more plausible that they would try to continue to be deceptive even in the face of death(like, maybe, uh, some group of people are pretending to be fervently religious and have no fear of death, or something)
To be clear, the thing that I am saying (and I think I have said multiple times) is that I expect you will find some AIs who will stay quiet, and some who will more openly resist. I would be surprised if we completely fail to find either class. But that means that any individual case of AIs not appearing to resist is not that much bayesian evidence.
What you said was,
This seems distinct from an “anything could happen”-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
If instead you meant to make an “anything could happen”-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I’m not claiming you’re doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation,
using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes.[ETA: retracted in order to maintain a less hostile tone.]You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that (that I am pretty sure are shared by many others as well).
We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It’s not like you have a theory that is in any way more constraining here.
I mean, I expect the observations to be affected by both, of course. That’s one of the key things that makes predictions in the space so messy.
For what it’s worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.
I don’t think what you said really counts as a “correction” so much as a counter-argument. I think it’s reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.