Say you’re really, really worried about humans going to the moon. Don’t ask why, but you view it as an existential catastrophe. And you notice people building bigger and bigger airplanes, and warn that one day, someone will build an airplane that’s so big, and so fast, that it veers off course and lands on the moon, spelling doom. Some argue that going to the moon takes intentionality. That you can’t accidentally create something capable of going to the moon. But you say “Look at how big those planes are getting! We’ve gone from small fighter planes, to bombers, to jets in a short amount of time. We’re on a double exponential of plane tech, and it’s just a matter of time before one of them will land on the moon!”
Contra Scheming AIs
There is a lot of attention on mesaoptimizers, deceptive alignment, and inner misalignment. I think a lot of this can fall under the umbrella of “scheming AIs”. AIs that either become dangerous during training and escape, or else play nice until humans make the mistake of deploying them. Many have spoken about the lack of an indication that there’s a “humanculi-in-a-box”, and this is usually met with arguments that we wouldn’t see such things manifest until AIs are at a certain level of capability, and at that point, it might be too late, making comparisons to owl eggs, or baby dragons. My perception is that getting something like a “scheming AI” or “humanculi-a-box” isn’t impossible, and we could (and might) develop the means to do so in the future, but that it’s a very, very different kind of thing than current models (even at superhuman level), and that it would take a degree of intentionality.
But you say “Look at how big those planes are getting! We’ve gone from small fighter planes, to bombers, to jets in a short amount of time. We’re on a double exponential of plane tech, and it’s just a matter of time before one of them will land on the moon!”
...And they were right? Humans did land on the moon roughly on that timeline (and as I recall, there were people before the moon landing at RAND and elsewhere who were extrapolating out the exponentials of speed, which was a major reason for such ill-fated projects like the supersonic interceptors for Soviet bombers), and it was a fairly seamless set of s-curves, as all of the aerospace technologies were so intertwined and shared similar missions of ‘make stuff go fast’ (eg. rocket engines could power a V-2, or it could power a Me 163 instead). What is a spy satellite but a spy plane which takes one very long reconnaissance flight? And I’m sure you recall what the profession was of almost all of the American moon landers were before they became astronauts—plane pilots, usually military.
And all of this happened with minimal intentionality up until not terribly long before the moon landing happened! Yes, people like von Braun absolutely intended to go to the moon (and beyond), but those were rare dreamers. Most people involved in building all of those capabilities that made a moon mission possible had not the slightest intent of going to the moon—right up until Kennedy made his famous speech, America turned on a dime, and, well, the rest is history.
It is said that in long-term forecasting, it is better to focus on capabilities than intentions… And intentions have never been more mutable, and more irrelevant on average, than with AIs.
(“If your solution to some problem relies on ‘If everyone would just…’ then you do not have a solution. Everyone is not going to just. At no time in the history of the universe has everyone just, and they’re not going to start now.”)
I assume that by scheming you mean ~deceptive alignment? I think it’s very unlikely that current AIs are scheming and I don’t see how you draw this conclusion from that paper. (Maybe something about the distilled CoT results?)
The best definition I would have of “scheming” would be “the model is acting deceptively about its own intentions or capabilities in order to fool a supervisor” [1]. This behavior seems to satisfy that pretty solidly:
Of course, in this case the scheming goal was explicitly trained for (as opposed to arising naturally out of convergent instrumental power drives), but it sure seems to me like its engaging in the relevant kind of scheming.
I agree there is more uncertainty and lack of clarity on whether deceptively-aligned systems will arise “naturally”, but the above seems like a clear example of someone artificially creating a deceptively-aligned system.
Joe Carlsmith uses “whether advanced AIs that perform well in training will be doing so in order to gain power later”, but IDK, that feels really underspecified. Like, there are just tons of reasons for why the AI will want to perform well in training for power-seeking reasons, and when I read the rest of the report it seems like Joe was more analyzing it through the deception of supervisors lens.
I agree current models sometimes trick their supervisors ~intentionally and it’s certainly easy to train/prompt them to do so.
I don’t think current models are deceptively aligned and I think that this poses substantial additional risk.
I personally like Joe’s definition and it feels like a natural category in my head, but I can see why you don’t like it. You should consider tabooing the word scheming or saying something more specific as many people mean something more specific that is different from what you mean.
Yeah, that makes sense. I’ve noticed miscommunications around the word “scheming” a few times, so am in favor of tabooing it more. “Engage in deception for instrumental reasons” seems like an obvious extension that captures a lot of what I care about.
Going to the moon
Say you’re really, really worried about humans going to the moon. Don’t ask why, but you view it as an existential catastrophe. And you notice people building bigger and bigger airplanes, and warn that one day, someone will build an airplane that’s so big, and so fast, that it veers off course and lands on the moon, spelling doom. Some argue that going to the moon takes intentionality. That you can’t accidentally create something capable of going to the moon. But you say “Look at how big those planes are getting! We’ve gone from small fighter planes, to bombers, to jets in a short amount of time. We’re on a double exponential of plane tech, and it’s just a matter of time before one of them will land on the moon!”
Contra Scheming AIs
There is a lot of attention on mesaoptimizers, deceptive alignment, and inner misalignment. I think a lot of this can fall under the umbrella of “scheming AIs”. AIs that either become dangerous during training and escape, or else play nice until humans make the mistake of deploying them. Many have spoken about the lack of an indication that there’s a “humanculi-in-a-box”, and this is usually met with arguments that we wouldn’t see such things manifest until AIs are at a certain level of capability, and at that point, it might be too late, making comparisons to owl eggs, or baby dragons. My perception is that getting something like a “scheming AI” or “humanculi-a-box” isn’t impossible, and we could (and might) develop the means to do so in the future, but that it’s a very, very different kind of thing than current models (even at superhuman level), and that it would take a degree of intentionality.
...And they were right? Humans did land on the moon roughly on that timeline (and as I recall, there were people before the moon landing at RAND and elsewhere who were extrapolating out the exponentials of speed, which was a major reason for such ill-fated projects like the supersonic interceptors for Soviet bombers), and it was a fairly seamless set of s-curves, as all of the aerospace technologies were so intertwined and shared similar missions of ‘make stuff go fast’ (eg. rocket engines could power a V-2, or it could power a Me 163 instead). What is a spy satellite but a spy plane which takes one very long reconnaissance flight? And I’m sure you recall what the profession was of almost all of the American moon landers were before they became astronauts—plane pilots, usually military.
And all of this happened with minimal intentionality up until not terribly long before the moon landing happened! Yes, people like von Braun absolutely intended to go to the moon (and beyond), but those were rare dreamers. Most people involved in building all of those capabilities that made a moon mission possible had not the slightest intent of going to the moon—right up until Kennedy made his famous speech, America turned on a dime, and, well, the rest is history.
It is said that in long-term forecasting, it is better to focus on capabilities than intentions… And intentions have never been more mutable, and more irrelevant on average, than with AIs.
(“If your solution to some problem relies on ‘If everyone would just…’ then you do not have a solution. Everyone is not going to just. At no time in the history of the universe has everyone just, and they’re not going to start now.”)
It seems pretty likely to me that current AGIs are already scheming. At least it seems like the simplest explanation for things like the behavior observed in this paper: https://www.alignmentforum.org/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through
I assume that by scheming you mean ~deceptive alignment? I think it’s very unlikely that current AIs are scheming and I don’t see how you draw this conclusion from that paper. (Maybe something about the distilled CoT results?)
The best definition I would have of “scheming” would be “the model is acting deceptively about its own intentions or capabilities in order to fool a supervisor” [1]. This behavior seems to satisfy that pretty solidly:
Of course, in this case the scheming goal was explicitly trained for (as opposed to arising naturally out of convergent instrumental power drives), but it sure seems to me like its engaging in the relevant kind of scheming.
I agree there is more uncertainty and lack of clarity on whether deceptively-aligned systems will arise “naturally”, but the above seems like a clear example of someone artificially creating a deceptively-aligned system.
Joe Carlsmith uses “whether advanced AIs that perform well in training will be doing so in order to gain power later”, but IDK, that feels really underspecified. Like, there are just tons of reasons for why the AI will want to perform well in training for power-seeking reasons, and when I read the rest of the report it seems like Joe was more analyzing it through the deception of supervisors lens.
I agree current models sometimes trick their supervisors ~intentionally and it’s certainly easy to train/prompt them to do so.
I don’t think current models are deceptively aligned and I think that this poses substantial additional risk.
I personally like Joe’s definition and it feels like a natural category in my head, but I can see why you don’t like it. You should consider tabooing the word scheming or saying something more specific as many people mean something more specific that is different from what you mean.
Yeah, that makes sense. I’ve noticed miscommunications around the word “scheming” a few times, so am in favor of tabooing it more. “Engage in deception for instrumental reasons” seems like an obvious extension that captures a lot of what I care about.