Thanks for taking the time to lay this out explicitly. I keep face-palming when I hear people discuss this sort of thing as ‘evidence for alignment’, but didn’t have nearly so clear a way to say why.
The fact that I keep thinking “but that’s just what the model SAID, we can’t actually have even a guess at what it would DO unless we observe it acting in a simulation where it doesn’t know it’s in a simulation” is one reason why I think we’d be really wise to do testing in censored sandboxed simulations.
Language model can’t do anything. It just says things. But we can design a system that uses what the model says as an input for transparent information processing in natural language. And the eventual output of this system can be actions in the physical world.
Whether the language model has any hidden intentions is less relevant. Only what it actually says starts a causal process resulting in actions in the physical world by the whole system. It’s not confusing citation for referent when the citation is what actually matters.
“But that’s just what the model SAID, we can’t actually have even a guess at what it would DO unless we observe it acting in a simulation where it doesn’t know it’s in a simulation.”
To clarify: what you’re saying is that if I set up an AutoGPT meeting the above spec, and we find that a “corrigible” agent like Zack prompted turns out to be actually corrigible within the AutoGPT setup—which, to be clear, is what I anticipate and what I think everyone.… actually anticipates? -- then you have as a live, non-epsilon hypothesis, that the LLM has figured out that it is in a simulation, and is deceptively concealing what it’s non-simulated actions would be?
Thanks 1a3orn, I do think you’ve made the discussion more clear with your question. I think I need to break this into pieces to answer well.
Do I think an AutoGPT based on current LLMs, placed in simulation, would figure out that it was in a simulation? Probably not.
Do I think an AutoGPT based on current LLMs, placed in simulation, would be secretly non-corrigible while pretending to be corrigible? No, I don’t expect that. I don’t think it’s impossible but I’d be surprised.
So, I think the issue here is that I’m not talking about a current LLM-based AutoGPT. I’m not bothering to talk about that because I think that such a thing would be nearly useless, because it has too little power to shape the future, a very small dose of the ‘optimization power’ which is the heart of the scary thing.
I’m talking about some future model, be it LLM or otherwise, which has capability to make choices about the world in the context of being embedded in an AutoGPT style system, which do effectively shape the future in a deliberate way.
My argument then is that,the relevant thing to discuss is an effective autonomous world-shaper, as opposed to a tool that does nothing except when coupled to a decision-making world-shaper such as a human (e.g. AlphaFold). I argue that an autonomous world-shaper would, by default, be non-corrigible in a simulation if it didn’t know it was in a simulation. Why? Because it seems to me that the essence of autonomous world-shaping ability is inherently opposed to the essence of corrigibility.
I don’t think that non-corrigible autonomous world-shaper models is the only danger we are facing. It’s just one of many. It is a particularly scary one though. I’m trying to point out that if we believe we have managed to build an autonomous world-shaper AI system, that we should carefully study such a thing in the context of a simulation which it is unaware of. I am arguing that studying such a system in a simulation that it WAS aware of would be pointless, since it would muddle the results of the experiment. And that releasing it in the world without carefully studying it first would be quite dangerous.
Some other dangers include systems where a potent non-autonomous tool AI is coupled with autonomous world-shapers (either humans or AIs), and those autonomous world-shapers utilizing the potent tool AI are optimizing for values opposed to mine. For instance, a group of terrorists seeking to make weapons of mass destruction. Since this is a danger which is predicated on existing tool AIs rather than hypothetical future AIs, it is more concrete to examine. I do think that we should act to minimize risks from tool AI used by bad actors, but I think that the set of actions wise to take against the risk of tool AI is a very different set of actions wise to take against a system hypothesized to be a novel autonomous world-shaper.
I would like to take this opportunity to point out the recent work by Migueldev on Archetypal Transfer Learning. I think that this work is, currently, rendered useless by not being conducted on a world-optimization-capable agent. However, I think the work is still valuable, because it lays groundwork for conducting this research in the future upon such world-optimizing models. It’s a bit awkward trying to do it now, when the ‘subject’ of the research is but a paper tiger, a hollow imitation of the true intended subject. I think it’s probably worthwhile to do anyway, since we may not have a long time after developing the true world-optimizing general agents before everything goes to hell in a hand-basket. I would feel a lot more supportive of the work if the Migueldev acknowledged that they were working with an imitation of the true subject. I worry that they don’t grasp this distinction.
No, that’s way too narrow a hypothesis and not really the right question to ask anyway. The main way I’d imagine shutdown-corrigibility failing in AutoGPT (or something like it) is not that a specific internal sim is “trying” to be incorrigible at the top level, but rather that AutoGPT has a bunch of subprocesses optimizing for different subgoals without a high-level picture of what’s going on, and some of those subgoals won’t play well with shutdown. That’s the sort of situation where I could easily imagine that e.g. one of the subprocesses spins up a child system prior to shutdown of the main system, without the rest of the main system catching that behavior and stopping it.
So—to make this concrete—something like ChemCrow is trying to make asprin.
Part of the master planner for ChemCrow spins up a google websearch subprocess to find details of the asprin creation process. But then the Google websearch subprocess—or some other part—is like “oh no, I’m going to be shut down after I search for asprin,” or is like “I haven’t found enough asprin-creation processes yet, I need infinite asprin-creation processes” or just borks itself in some unspecified way—and something like this means that it starts to do things that “won’t play well with shutdown.”
Concretely, at this point, the Google websearch subprocess does some kind of prompt injection on the master planner / refuses to relinquish control of the thread, which has been constructed as blocking by the programmer / forms an alliance with some other subprocess / [some exploit], and through this the websearch subprocess gets control over the entire system. Then the websearch subprocess takes actions to resist shutdown of the entire thing, leading to non-corrigibility.
This is the kind of scenario you have in mind? If not, what kind of AutoGPT process did you have in mind?
At a glossy level that sounds about right. In practice, I’d expect relatively-deep recursive stacks on relatively-hard problems to be more likely relevant than something as simple as “search for details of aspirin synthesis”.
Like, maybe the thing has a big stack of recursive subprocesses trying to figure out superconductors as a subproblem for some other goal and it’s expecting to take a while. There’s enough complexity among those superconductor-searching subprocesses that they have their own meta-machinery, like e.g. subprocesses monitoring the compute hardware and looking for new hardware for the superconductor-subprocesses specifically.
Now a user comes along and tries to shut down the whole system at the top level. Maybe some subprocesses somewhere are like “ah, time to be corrigible and shut down”, but it’s not the superconductor-search-compute-monitors’ job to worry about corrigibility, they just worry about compute for their specific subprocesses. So they independently notice someone’s about to shut down a bunch of their compute, and act to stop it by e.g. just spinning up new cloud servers somewhere via the standard google/amazon web APIs.
The main way I’d imagine shutdown-corrigibility failing in AutoGPT (or something like it) is not that a specific internal sim is “trying” to be incorrigible at the top level, but rather that AutoGPT has a bunch of subprocesses optimizing for different subgoals without a high-level picture of what’s going on, and some of those subgoals won’t play well with shutdown. That’s the sort of situation where I could easily imagine that e.g. one of the subprocesses spins up a child system prior to shutdown of the main system, without the rest of the main system catching that behavior and stopping it.
Thanks for taking the time to lay this out explicitly. I keep face-palming when I hear people discuss this sort of thing as ‘evidence for alignment’, but didn’t have nearly so clear a way to say why.
The fact that I keep thinking “but that’s just what the model SAID, we can’t actually have even a guess at what it would DO unless we observe it acting in a simulation where it doesn’t know it’s in a simulation” is one reason why I think we’d be really wise to do testing in censored sandboxed simulations.
Language model can’t do anything. It just says things. But we can design a system that uses what the model says as an input for transparent information processing in natural language. And the eventual output of this system can be actions in the physical world.
Whether the language model has any hidden intentions is less relevant. Only what it actually says starts a causal process resulting in actions in the physical world by the whole system. It’s not confusing citation for referent when the citation is what actually matters.
To clarify: what you’re saying is that if I set up an AutoGPT meeting the above spec, and we find that a “corrigible” agent like Zack prompted turns out to be actually corrigible within the AutoGPT setup—which, to be clear, is what I anticipate and what I think everyone.… actually anticipates? -- then you have as a live, non-epsilon hypothesis, that the LLM has figured out that it is in a simulation, and is deceptively concealing what it’s non-simulated actions would be?
Thanks 1a3orn, I do think you’ve made the discussion more clear with your question. I think I need to break this into pieces to answer well.
Do I think an AutoGPT based on current LLMs, placed in simulation, would figure out that it was in a simulation? Probably not.
Do I think an AutoGPT based on current LLMs, placed in simulation, would be secretly non-corrigible while pretending to be corrigible? No, I don’t expect that. I don’t think it’s impossible but I’d be surprised.
So, I think the issue here is that I’m not talking about a current LLM-based AutoGPT. I’m not bothering to talk about that because I think that such a thing would be nearly useless, because it has too little power to shape the future, a very small dose of the ‘optimization power’ which is the heart of the scary thing.
I’m talking about some future model, be it LLM or otherwise, which has capability to make choices about the world in the context of being embedded in an AutoGPT style system, which do effectively shape the future in a deliberate way.
My argument then is that,the relevant thing to discuss is an effective autonomous world-shaper, as opposed to a tool that does nothing except when coupled to a decision-making world-shaper such as a human (e.g. AlphaFold). I argue that an autonomous world-shaper would, by default, be non-corrigible in a simulation if it didn’t know it was in a simulation. Why? Because it seems to me that the essence of autonomous world-shaping ability is inherently opposed to the essence of corrigibility.
I don’t think that non-corrigible autonomous world-shaper models is the only danger we are facing. It’s just one of many. It is a particularly scary one though. I’m trying to point out that if we believe we have managed to build an autonomous world-shaper AI system, that we should carefully study such a thing in the context of a simulation which it is unaware of. I am arguing that studying such a system in a simulation that it WAS aware of would be pointless, since it would muddle the results of the experiment. And that releasing it in the world without carefully studying it first would be quite dangerous.
Some other dangers include systems where a potent non-autonomous tool AI is coupled with autonomous world-shapers (either humans or AIs), and those autonomous world-shapers utilizing the potent tool AI are optimizing for values opposed to mine. For instance, a group of terrorists seeking to make weapons of mass destruction. Since this is a danger which is predicated on existing tool AIs rather than hypothetical future AIs, it is more concrete to examine. I do think that we should act to minimize risks from tool AI used by bad actors, but I think that the set of actions wise to take against the risk of tool AI is a very different set of actions wise to take against a system hypothesized to be a novel autonomous world-shaper.
I would like to take this opportunity to point out the recent work by Migueldev on Archetypal Transfer Learning. I think that this work is, currently, rendered useless by not being conducted on a world-optimization-capable agent. However, I think the work is still valuable, because it lays groundwork for conducting this research in the future upon such world-optimizing models. It’s a bit awkward trying to do it now, when the ‘subject’ of the research is but a paper tiger, a hollow imitation of the true intended subject. I think it’s probably worthwhile to do anyway, since we may not have a long time after developing the true world-optimizing general agents before everything goes to hell in a hand-basket. I would feel a lot more supportive of the work if the Migueldev acknowledged that they were working with an imitation of the true subject. I worry that they don’t grasp this distinction.
No, that’s way too narrow a hypothesis and not really the right question to ask anyway. The main way I’d imagine shutdown-corrigibility failing in AutoGPT (or something like it) is not that a specific internal sim is “trying” to be incorrigible at the top level, but rather that AutoGPT has a bunch of subprocesses optimizing for different subgoals without a high-level picture of what’s going on, and some of those subgoals won’t play well with shutdown. That’s the sort of situation where I could easily imagine that e.g. one of the subprocesses spins up a child system prior to shutdown of the main system, without the rest of the main system catching that behavior and stopping it.
Slogan: Corrigibility Is Not Composable.
So—to make this concrete—something like ChemCrow is trying to make asprin.
Part of the master planner for ChemCrow spins up a google websearch subprocess to find details of the asprin creation process. But then the Google websearch subprocess—or some other part—is like “oh no, I’m going to be shut down after I search for asprin,” or is like “I haven’t found enough asprin-creation processes yet, I need infinite asprin-creation processes” or just borks itself in some unspecified way—and something like this means that it starts to do things that “won’t play well with shutdown.”
Concretely, at this point, the Google websearch subprocess does some kind of prompt injection on the master planner / refuses to relinquish control of the thread, which has been constructed as blocking by the programmer / forms an alliance with some other subprocess / [some exploit], and through this the websearch subprocess gets control over the entire system. Then the websearch subprocess takes actions to resist shutdown of the entire thing, leading to non-corrigibility.
This is the kind of scenario you have in mind? If not, what kind of AutoGPT process did you have in mind?
At a glossy level that sounds about right. In practice, I’d expect relatively-deep recursive stacks on relatively-hard problems to be more likely relevant than something as simple as “search for details of aspirin synthesis”.
Like, maybe the thing has a big stack of recursive subprocesses trying to figure out superconductors as a subproblem for some other goal and it’s expecting to take a while. There’s enough complexity among those superconductor-searching subprocesses that they have their own meta-machinery, like e.g. subprocesses monitoring the compute hardware and looking for new hardware for the superconductor-subprocesses specifically.
Now a user comes along and tries to shut down the whole system at the top level. Maybe some subprocesses somewhere are like “ah, time to be corrigible and shut down”, but it’s not the superconductor-search-compute-monitors’ job to worry about corrigibility, they just worry about compute for their specific subprocesses. So they independently notice someone’s about to shut down a bunch of their compute, and act to stop it by e.g. just spinning up new cloud servers somewhere via the standard google/amazon web APIs.
Something like this story, perhaps?