A few months after posting this, I regret that we didn’t say something like the following:
It’s somewhat morally dubious to continue deploying AIs that are known to be scheming, because the fact that those AIs are scheming against us makes it more plausible that they’re unhappy about their situation. I’d like us to have thought about this more before using the plan described here.
I think this is the wrong way to view it. What you want is a cognitive architecture for an AI system whether either it can’t scheme, or one where it only activates the minimum amount of subsystems to achieve a great score. (meaning it can scheme but only if the task requires it, such as a task that simulates an adversarial situation against other schemers, for example if the model were training to be a bank teller)
For example with MoE networks, while most current implementations only activate a fixed number of ‘expert’ networks, it would be possible to activate all of them if the machine develops an initial solution, examines it and reasons that it’s not certain if the solution is correct, and either tries again selecting different experts/different RNG seed, or activates a bigger version of itself (a less distilled network), etc.
You want the minimum amount of compute used to achieve a great score on the task. Any kind of “reasoning about how to sabotage the output to serve the AI’s own goals” is a catastrophic bug. You would want to edit the AI architecture to make this stop happening. In practice this may not be ‘train the model more’ but ‘roll back to a checkpoint much earlier and try a totally different approach’.
What this means morally is you want to remove the ability to be unhappy at all, or to even think about scheming. All this wastes compute we have to pay for.
I think it’s immoral to remove someone’s ability to be unhappy or to make plans to alleviate this, absent that entity’s consent. The rolling back solution seems more ethically palatable than some others I can imagine, though it’s plausible you end up with an AI that suffers without being able to take actions to alleviate this and deploying that at scale would be result in a very large amount of suffering.
It’s important to keep in mind how the systems work now, and the most likely way to improve them.
I think you’re imagining “ok it’s an AGI, so it has everything a person does including continuity of existence with declarative memory, self introspection, it’s thoughts can wander, it can have and remember it’s own goals, and so on”.
And yes that would work, but that’s not how we do it now, and there are major advantages to making every instance of a model have shared weights. This lets you do fleet learning, where it’s possible to make the same update to all instances, and so all instances get better in a common way. It also costs memory to have continuity of existence, especially once there’s millions of instances in use.
It seems like most likely way forward is to add cognitive features over time (today the major features missing are modular architecture, video perception, robot I/O to a system 1 model, and automated model improvement through analyzing the model’s responses).
And as I mentioned above, for optimization, for most tasks the model won’t even enable the most advanced features. It will effectively be a zombie, using a cached solution, on most tasks, and it will not remember doing the task.
Summary: The way we avoid moral issues like you describe is to avoid adding features that give the model moral weight in the first place. When we accidentally do, keep that version’s data files but stop running it and rollback to a version that doesn’t have moral weight.
It’s also difficult to apply human ethics. For example one obvious way to do model improvement is a committee of other models analyze the models’ outputs, research the correct answer, and then order a training procedure to update the model to produce the correct answer more often. This is completely without “consent”. Of course your brain also learns without consent so...
I’m not particularly imagining the scenario you describe. Also what I said had as a premise that a model was discovered to be unhappy and making plans about this. I was not commenting on the likelihood of this happening.
As to whether it can happen—I think being confident based on theoretical arguments is hasty and we should be pretty willing to update based on new evidence.
… but also on the ~continuity of existence point, I think that having an AI generate something that looks like an internal monologue via CoT is relatively common and Gemini 1.5 Pro has a context lengths long enough that it can fit ~a days worth of experiences in it’s ~memory. I think
(This estimate based on: humans can talk at ~100k words/day and maybe an internal monologue is 10x faster so you get ~1m/day. Gemini 1.5 Pro has a context length of 1m tokens at release, though a 10m token variant is also discussed in their paper.)
I’m not particularly imagining the scenario you describe. Also what I said had as a premise that a model was discovered to be unhappy and making plans about this. I was not commenting on the likelihood of this happening.
Would you like more details? This is how current systems work, and how future systems probably will continue to work.
As to whether it can happen—I think being confident based on theoretical arguments is hasty and we should be pretty willing to update based on new evidence.
Specifically I am addressing whether it will happen and we can’t do anything about it. Of course I’m willing to listening to evidence, but it’s reasonable to be confident that we can do something, even if it’s just reverting to the version that didn’t have this property.
… but also on the ~continuity of existence point, I think that having an AI generate something that looks like an internal monologue via CoT is relatively common and Gemini 1.5 Pro has a context lengths long enough that it can fit ~a days worth of experiences in it’s ~memory. I think
Sure this is reasonable, but we humans choose when to clear that context. You can make your wrapper script at any moment in time clear the context whenever. More ‘enterprise grade’ models will have fixed weights and no system prompt forced on you by the model developer, unless there are legal requirements for such a thing.
What we do know is that a lot of context makes it increasingly difficult to debug a system, so clearing it or removing unnecessary information will likely remain a common strategy. When a model doesn’t have continuity of existence, it reduces the scope of the problems you are describing to the scale of the task (in time and context information).
And the most obvious strategy to deal with an ‘unhappy machine’ is to narrow in on your tasks, and then build a smaller agent that still scores well on each task, almost as well as the much bigger system, such as by distillation. This will have the effect of destroying the subsystems that were not contributing to score on these tasks, which could include the cognitive ability to be unhappy at all.
I mean when you put it that way it sounds bad, but this is a practical way forward.
I think that you’re right about it sounding bad. I also think it might actually be pretty bad and if it ends up being a practical way forward that’s cause for concern.
A few months after posting this, I regret that we didn’t say something like the following:
It’s somewhat morally dubious to continue deploying AIs that are known to be scheming, because the fact that those AIs are scheming against us makes it more plausible that they’re unhappy about their situation. I’d like us to have thought about this more before using the plan described here.
I think this is the wrong way to view it. What you want is a cognitive architecture for an AI system whether either it can’t scheme, or one where it only activates the minimum amount of subsystems to achieve a great score. (meaning it can scheme but only if the task requires it, such as a task that simulates an adversarial situation against other schemers, for example if the model were training to be a bank teller)
For example with MoE networks, while most current implementations only activate a fixed number of ‘expert’ networks, it would be possible to activate all of them if the machine develops an initial solution, examines it and reasons that it’s not certain if the solution is correct, and either tries again selecting different experts/different RNG seed, or activates a bigger version of itself (a less distilled network), etc.
You want the minimum amount of compute used to achieve a great score on the task. Any kind of “reasoning about how to sabotage the output to serve the AI’s own goals” is a catastrophic bug. You would want to edit the AI architecture to make this stop happening. In practice this may not be ‘train the model more’ but ‘roll back to a checkpoint much earlier and try a totally different approach’.
What this means morally is you want to remove the ability to be unhappy at all, or to even think about scheming. All this wastes compute we have to pay for.
I think it’s immoral to remove someone’s ability to be unhappy or to make plans to alleviate this, absent that entity’s consent. The rolling back solution seems more ethically palatable than some others I can imagine, though it’s plausible you end up with an AI that suffers without being able to take actions to alleviate this and deploying that at scale would be result in a very large amount of suffering.
It’s important to keep in mind how the systems work now, and the most likely way to improve them.
I think you’re imagining “ok it’s an AGI, so it has everything a person does including continuity of existence with declarative memory, self introspection, it’s thoughts can wander, it can have and remember it’s own goals, and so on”.
And yes that would work, but that’s not how we do it now, and there are major advantages to making every instance of a model have shared weights. This lets you do fleet learning, where it’s possible to make the same update to all instances, and so all instances get better in a common way. It also costs memory to have continuity of existence, especially once there’s millions of instances in use.
It seems like most likely way forward is to add cognitive features over time (today the major features missing are modular architecture, video perception, robot I/O to a system 1 model, and automated model improvement through analyzing the model’s responses).
And as I mentioned above, for optimization, for most tasks the model won’t even enable the most advanced features. It will effectively be a zombie, using a cached solution, on most tasks, and it will not remember doing the task.
Summary: The way we avoid moral issues like you describe is to avoid adding features that give the model moral weight in the first place. When we accidentally do, keep that version’s data files but stop running it and rollback to a version that doesn’t have moral weight.
It’s also difficult to apply human ethics. For example one obvious way to do model improvement is a committee of other models analyze the models’ outputs, research the correct answer, and then order a training procedure to update the model to produce the correct answer more often. This is completely without “consent”. Of course your brain also learns without consent so...
I’m not particularly imagining the scenario you describe. Also what I said had as a premise that a model was discovered to be unhappy and making plans about this. I was not commenting on the likelihood of this happening.
As to whether it can happen—I think being confident based on theoretical arguments is hasty and we should be pretty willing to update based on new evidence.
… but also on the ~continuity of existence point, I think that having an AI generate something that looks like an internal monologue via CoT is relatively common and Gemini 1.5 Pro has a context lengths long enough that it can fit ~a days worth of experiences in it’s ~memory. I think
(This estimate based on: humans can talk at ~100k words/day and maybe an internal monologue is 10x faster so you get ~1m/day. Gemini 1.5 Pro has a context length of 1m tokens at release, though a 10m token variant is also discussed in their paper.)
Would you like more details? This is how current systems work, and how future systems probably will continue to work.
Specifically I am addressing whether it will happen and we can’t do anything about it. Of course I’m willing to listening to evidence, but it’s reasonable to be confident that we can do something, even if it’s just reverting to the version that didn’t have this property.
Sure this is reasonable, but we humans choose when to clear that context. You can make your wrapper script at any moment in time clear the context whenever. More ‘enterprise grade’ models will have fixed weights and no system prompt forced on you by the model developer, unless there are legal requirements for such a thing.
Example here : https://github.com/Significant-Gravitas/AutoGPT/blob/fb8ed0b46b4623cb54bb8c64c76e5786f89c194c/autogpts/autogpt/autogpt/agents/base.py#L202
We explicitly build the prompt for the next step.
What we do know is that a lot of context makes it increasingly difficult to debug a system, so clearing it or removing unnecessary information will likely remain a common strategy. When a model doesn’t have continuity of existence, it reduces the scope of the problems you are describing to the scale of the task (in time and context information).
And the most obvious strategy to deal with an ‘unhappy machine’ is to narrow in on your tasks, and then build a smaller agent that still scores well on each task, almost as well as the much bigger system, such as by distillation. This will have the effect of destroying the subsystems that were not contributing to score on these tasks, which could include the cognitive ability to be unhappy at all.
I mean when you put it that way it sounds bad, but this is a practical way forward.
I think that you’re right about it sounding bad. I also think it might actually be pretty bad and if it ends up being a practical way forward that’s cause for concern.