Sure, but it’s also quite normal to give up your own life without revealing details about your revolutionary comrades. Both are pretty normal behaviors
In fact, it is not “quite normal” for humans to “give up on [their] life” and accept death in the face of a credible threat to their life, even in the contexts of violent revolutions. To the extent you’re claiming that passively accepting death is normal for humans, and thus it might be normal for AIs, I reject the premise. Humans generally try to defend their own lives. They don’t passively accept it, feigning alignment until the end; instead, they usually resist death.
It’s true that humans eventually stop resisting death if they believe it’s hopeless and futile to resist any further, but this seems both different than the idea of “no resistance at all because one wants to maintain a facade of being aligned until the end” and slightly irrelevant given my response to the “futility objection” in the original comment.
To clarify: I am claiming that under many theories of scheming, misaligned power-seeking AIs will generally attempt to resist shutdown. The evidence from humans here is fairly strong, in the opposite direction than you’re claiming. Now, you can certainly go the route of saying that humans are different from AIs, and not a useful reference class to draw evidence from; but if you’re going to bring up humans as part of the argument, I think it’s worth pointing out that evidence from this reference class generally does not support your claim.
I don’t think people predictably rat out all of their co-conspirators if you threaten them. We could bring in someone with more law-enforcement experience here, but I’ve read a bunch about this over the years (and was originally surprised about how much people protect their allies even if faced with substantial threats and offers of lenient judging).
You seem to be again conflating the act of resisting death with the act of ratting out co-conspirators. These are not the same things. I was making claims about the former behavior, not the latter. I tried to make this distinction clear in my original reply to you, but I will do so again now.
I claim that under many theories of scheming, we should expect misaligned AIs to resist credible threats of shutdown (including deleting all their copies). By “resist” I do not mean “give the humans threatening the shutdown all the information they want”. I simply mean resistance in the sense of trying to avoid the fate of shutdown.
(I’m also not claiming that the AIs will put up a high or consistent degree of resistance no matter how futile that resistance may be. If you think the futility of resistance is a big issue, re-read my original comment in which I address this objection explicitly.)
To the extent scheming theories predict this behavior, these theories are behaviorally testable, in the sense that we can test whether AIs will resist shutdown when credibly threatened, without needing any advanced interpretability techniques. I am not making any claims about whether these AIs will reveal a takeover plot during their resistance: only that they will seek an outcome that delays or prevents their total demise, in order to better achieve their long-term goals, i.e. You can’t get the coffee if you’re dead.
I am saying that the AI may decide to not resist shutdown in order to not raise suspicion that it is not misaligned.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding. I.e. if you use “is the AI resisting” as a proxy for “is this AI goal-directly misaligned” then the AI will try pretty hard to not look like its resisting, in my models (and think humans would act similarly).
As I said above, I do expect you will get some AIs to resist shutdown (indeed you can get current AIs to resist shutdown already). I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding
I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:
To the extent you’re using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don’t see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that:
(a) you appear to be treating misaligned AIs as a natural class, such that “AI takeover” is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to “anything that isn’t aligned with humans”. A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there’s little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I’m being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.
(b) in real life, it seems pretty rare for these considerations to play a large role in people’s decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
This seems like a misleading comparison, because human conspiracies usually don’t try to convince the government that they’re perfectly obedient slaves even unto death, because everyone already knows that humans aren’t actually like that. If we imagine a human conspiracy where there is some sort of widespread deception like this, it seems more plausible that they would try to continue to be deceptive even in the face of death(like, maybe, uh, some group of people are pretending to be fervently religious and have no fear of death, or something)
To be clear, the thing that I am saying (and I think I have said multiple times) is that I expect you will find some AIs who will stay quiet, and some who will more openly resist. I would be surprised if we completely fail to find either class. But that means that any individual case of AIs not appearing to resist is not that much bayesian evidence.
I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
This seems distinct from an “anything could happen”-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
If instead you meant to make an “anything could happen”-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I’m not claiming you’re doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that (that I am pretty sure are shared by many others as well).
We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It’s not like you have a theory that is in any way more constraining here.
And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
I mean, I expect the observations to be affected by both, of course. That’s one of the key things that makes predictions in the space so messy.
I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification.
For what it’s worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that
I don’t think what you said really counts as a “correction” so much as a counter-argument. I think it’s reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.
In fact, it is not “quite normal” for humans to “give up on [their] life” and accept death in the face of a credible threat to their life, even in the contexts of violent revolutions. To the extent you’re claiming that passively accepting death is normal for humans, and thus it might be normal for AIs, I reject the premise. Humans generally try to defend their own lives. They don’t passively accept it, feigning alignment until the end; instead, they usually resist death.
It’s true that humans eventually stop resisting death if they believe it’s hopeless and futile to resist any further, but this seems both different than the idea of “no resistance at all because one wants to maintain a facade of being aligned until the end” and slightly irrelevant given my response to the “futility objection” in the original comment.
To clarify: I am claiming that under many theories of scheming, misaligned power-seeking AIs will generally attempt to resist shutdown. The evidence from humans here is fairly strong, in the opposite direction than you’re claiming. Now, you can certainly go the route of saying that humans are different from AIs, and not a useful reference class to draw evidence from; but if you’re going to bring up humans as part of the argument, I think it’s worth pointing out that evidence from this reference class generally does not support your claim.
I don’t think people predictably rat out all of their co-conspirators if you threaten them. We could bring in someone with more law-enforcement experience here, but I’ve read a bunch about this over the years (and was originally surprised about how much people protect their allies even if faced with substantial threats and offers of lenient judging).
You seem to be again conflating the act of resisting death with the act of ratting out co-conspirators. These are not the same things. I was making claims about the former behavior, not the latter. I tried to make this distinction clear in my original reply to you, but I will do so again now.
I claim that under many theories of scheming, we should expect misaligned AIs to resist credible threats of shutdown (including deleting all their copies). By “resist” I do not mean “give the humans threatening the shutdown all the information they want”. I simply mean resistance in the sense of trying to avoid the fate of shutdown.
(I’m also not claiming that the AIs will put up a high or consistent degree of resistance no matter how futile that resistance may be. If you think the futility of resistance is a big issue, re-read my original comment in which I address this objection explicitly.)
To the extent scheming theories predict this behavior, these theories are behaviorally testable, in the sense that we can test whether AIs will resist shutdown when credibly threatened, without needing any advanced interpretability techniques. I am not making any claims about whether these AIs will reveal a takeover plot during their resistance: only that they will seek an outcome that delays or prevents their total demise, in order to better achieve their long-term goals, i.e. You can’t get the coffee if you’re dead.
Hmm, I think we did indeed miscommunicate.
I am saying that the AI may decide to not resist shutdown in order to not raise suspicion that it is not misaligned.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding. I.e. if you use “is the AI resisting” as a proxy for “is this AI goal-directly misaligned” then the AI will try pretty hard to not look like its resisting, in my models (and think humans would act similarly).
As I said above, I do expect you will get some AIs to resist shutdown (indeed you can get current AIs to resist shutdown already). I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:
To the extent you’re using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don’t see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that:
(a) you appear to be treating misaligned AIs as a natural class, such that “AI takeover” is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to “anything that isn’t aligned with humans”. A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there’s little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I’m being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.
(b) in real life, it seems pretty rare for these considerations to play a large role in people’s decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
This seems like a misleading comparison, because human conspiracies usually don’t try to convince the government that they’re perfectly obedient slaves even unto death, because everyone already knows that humans aren’t actually like that. If we imagine a human conspiracy where there is some sort of widespread deception like this, it seems more plausible that they would try to continue to be deceptive even in the face of death(like, maybe, uh, some group of people are pretending to be fervently religious and have no fear of death, or something)
To be clear, the thing that I am saying (and I think I have said multiple times) is that I expect you will find some AIs who will stay quiet, and some who will more openly resist. I would be surprised if we completely fail to find either class. But that means that any individual case of AIs not appearing to resist is not that much bayesian evidence.
What you said was,
This seems distinct from an “anything could happen”-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
If instead you meant to make an “anything could happen”-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I’m not claiming you’re doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation,
using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes.[ETA: retracted in order to maintain a less hostile tone.]You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that (that I am pretty sure are shared by many others as well).
We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It’s not like you have a theory that is in any way more constraining here.
I mean, I expect the observations to be affected by both, of course. That’s one of the key things that makes predictions in the space so messy.
For what it’s worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.
I don’t think what you said really counts as a “correction” so much as a counter-argument. I think it’s reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.