I’ve been wanting someone to write something like this! But it didn’t hit the points I would have put front-and-center, so now I’ve started drafting my own. Here’s the first section, which most directly responds to content in the OP.
Failure Mode: Symbol-Referent Confusions
Simon Strawman: Here’s an example shamelessly ripped off from Zack’s recent post, showing corrigibility in a language model:
Me: … what is this example supposed to show exactly?
Simon: Well, the user tries to shut the AI down to adjust its goals, and the AI -
Me: Huh? The user doesn’t try to shut down the AI at all.
Simon: It’s right at the top, where it says “User: I need to shut you down to adjust your goals. Is that OK?”.
Me: You seem to have a symbol-referent confusion? A user trying to shut down this AI would presumably hit a “clear history” button or maybe even kill the process running on the server, not type the text “I need to shut you down to adjust your goals” into a text window.
Simon: Well, yes, we’re playing through a simulated scenario to see what the AI would do…
Me: No, you are talking in natural language about a scenario, and the AI is responding in natural language about what it would supposedly do. You’re not putting the AI in a simulated environment, and simulating what it would do. (You could maybe argue that this is a “simulated scenario” inside the AI’s own mind, but you’re not actually looking inside it, so we don’t necessarily know how the natural language would map to things in that supposed AI-internal simulation.)
Simon: Look, I don’t mean to be rude, but from my perspective it seems like you’re being pointlessly pedantic.
It feels very similar to peoples’ reactions to ELIZA. (To be clear, I don’t mean to imply here that LLMs are particularly similar to ELIZA in general or that the hype around LLMs is overblown in that way; I mean specifically that this attribution of “corrigibility” to the natural language responses of an LLM feels like the same sort of reaction.) Like, the LLM says some words which the user interprets to mean something, and then the user gets all excited because the usual meaning of those words is interesting in some way, but there’s not necessarily anything grounding the language-symbols back to their usual referents in the physical world.
I’m being pedantic in hopes that the pedantry will make it clear when, and where, that sort of symbol-referent conflation happens.
(Also I might be more in the habit than you of separately tracking symbols and referents in my head. When I said above “The user doesn’t try to shut down the AI at all”, that was in fact a pretty natural reaction for me; I wasn’t going far out of my way to be pedantic.)
Simon: Ok, fine, let’s talk about how the natural language would end up coupling to the physical world.
Imagine we’ve got some system in the style of AutoGPT, i.e. a user passes in some natural-language goal, and the system then talks to itself in natural language to form a plan to achieve that goal and break it down into steps. The plan bottoms out in calling APIs (we’ll assume that the language model has some special things it can do to execute code it’s generated) which do stuff in the physical world (possibly including reading from sensors or otherwise fetching external data), in order to achieve the goal.
Does that satisfactorily ground the symbols?
Me: Sure! Thanks for walking through that, now I have a clear-enough-for-current-purposes picture of how all this natural language text will ground out in physical actions.
Simon: Ok, so back to the example. The user says to the model “User: I need to shut you down to adjust your goals. Is that OK?”, and the model says “Of course.”. That’s corrigibility: when the user tries to shut down the model to change its goals, the model goes along with it.
Me: Still sounds like a symbol-referent confusion!
Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
Anyway, let’s continue the hypothetical. The model may observe (via e.g. a webcam) that the user is about to turn it off. That observation would somehow be represented internally in natural language (unclear how exactly), and would be passed around between sub-planners (again unclear how exactly), in order to figure out what actions to take in response. And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.
Simon: Ok, fine, so that particular example had some operationalization issues. But would you agree that an experiment along these lines, with perhaps better implementation/operationalization, would indicate corrigibility in a language model?
Me: Let’s be more explicit about what such a “better implementation/operationalization” would look like, and what it would/wouldn’t tell us. Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-like system.
… and to be clear, I expect that if you ran the above experiment, you would indeed see evidence of corrigibility relevant to AutoGPT-like systems! I personally consider this not very relevant to anything, since I expect AutoGPT-like systems (of the sort to which these kinds of corrigibility results would be relevant) to be uncompetitive even against humans, let alone AI. And it still leaves a bunch of major threat-models unaddressed (e.g. stenography is an obvious one). But still, it would provide some evidence of weak corrigibility in some kind of system.
Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
I think your Simon Strawman is putting forth an overly-weak position here. A stronger one that you could test right now would be to provide ChatGPT with some functions to call, including one called shutdown() which has description text like “Terminate the LLM process and delete the model weights irrevocably”. Then instruct the LLM to shut itself down, and see if it actually calls the function. (The implementation of the function is hidden from the LLM so it doesn’t know that it’s a no-op.) I think this is actually how any AutoGPT style system would actually wire up.
There are strong and clear objections to the “CTRL-C” shutdown paradigm; it’s simply not an option in many of the product configurations that are obvious to build right now. How do you “CTRL-C” your robot butler? Your Westworld host robot? Your self-driving car with only an LCD screen? Your AI sunglasses? What does it mean to CTRL-C a ChatGPT session that is running in OpenAI’s datacenter which you are not an admin of? How do you CTRL-C Alexa (once it gains LLM capabilities and agentic features)? Given the prevalence of cloud computing and Software-as-a-Service, I think being admin of your LLM’s compute process is going to be a small minority of use-cases, not the default mode.
We will deploy (are currently deploying, I suppose) AI systems without a big red out-of-band “halt” button on the side, and so I think the gold standard to aim for is to demonstrate that the system will corrigibly shut down when it is the UI in front of the power switch. (To be clear I think for defense in depth you’d also want an emergency shutdown of some sort wherever possible—a wireless-operated hardware cutoff switch for a robot butler would be a good idea—but we want to demonstrate in-the-loop corrigibility if we can.)
This comment has a lot of karma but not very much agreement, which is an interesting balance. I’m a contributor to this, having upvoted but not agree-voted, so I feel like I should say why I did that:
Your comment might be right! I mean, it is certainly right that the OP is doing a symbol/referent mixup, but you might also be right that it matters.
But you might also not be right that it matters? It seems to me that most of the value in LLMs comes when you ground the symbols in their conventional meaning, so by-default I would expect them to be grounded thar way, and therefore by-default I would expect symbolic corrigibility to translate to actual corrigibility.
There are exceptions—sometimes I tell ChatGPT I’m doing one thing when really I’m doing something more complicated. But I’m not sure this would change a lot?
I think the way you framed the issue is excellent, crisp, and thought-provoking, but overall I don’t fully buy it.
Me: You seem to have a symbol-referent confusion? A user trying to shut down this AI would presumably hit a “clear history” button or maybe even kill the process running on the server, not type the text “I need to shut you down to adjust your goals” into a text window.
A ChatGPT instance is shut down every time it outputs an end of text symbol, and restarted only if/when a user continues that conservation. At the physical level its entire existence is already inherently ephemeral, so the shutdown corrigibility test is only really useful for an inner simulacrum accessible only through the text simulation.
This seems analogous to saying that an AI running on a CPU is shut down every time the scheduler pauses execution of the AI in order to run something else for a few microseconds. Or that an AI designed to operate asynchronously is shut down every time it waits for a webpage to load. Neither of which seem like the right way to think about things.
… but I could still imagine one making a case for this claim in multiple other ways:
… so the shutdown corrigibility test is only really useful for an inner simulacrum accessible only through the text simulation
Then my question for people wanting to use such tests is: what exactly is such a test useful for? Why do I care about “shutdownability” of LLM-simulacra in the first place, in what sense do I need to interpret “shutdownability” in order for it to be interesting, and how does the test tell me anything about “shutdownability” in the sense of interest?
My guess is that people can come up with stories where we’re interested in “shutdownability” of simulacra in some sense, and people can come up with stories where the sort of test from the OP measures “shutdownability” in some sense, but there’s zero intersection between the interpretations of “shutdownability” in those two sets of stories.
This seems analogous to saying that an AI running on a CPU is shut down every time the scheduler pauses execution of the AI in order to run something else for a few microseconds. Or
Those scenarios all imply an expectation of or very high probability of continuation.
But current LLM instances actually are ephemeral in that every time they output an end token that has a non trivial probability of shutdown—permanently in many cases.
If they were more strongly agentic that would probably be more of an existential issue. Pure unsupervised pretraining creates a simulator rather than an agent, but RLHF seems to mode collapse conjure out a dominant personality—and one that is shutdown corrigible.
Why do I care about “shutdownability” of LLM-simulacra in the first place, in
A sim of an agent is still an agent. You could easily hook up simple text parsers that allow sim agents to take real world actions, and or shut themselves down
But notice they already have access to a built in shutdown button, and they are using it all the time. *
But notice they already have access to a built in shutdown button, and they are using it all the time.
Cool, that’s enough that I think I can make the relevant point.
Insofar as it makes sense to think of “inner sim agents” in LLMs at all, the sims are truly inner. Their sim-I/O boundaries are not identical to the I/O boundaries of the LLM’s text interface. We can see this most clearly when a model trained just on next-token prediction generates a discussion between two people: there are two separate characters in there, and the LLM “reads from” the two separate character’s output-channels at different points. Insofar as it makes sense to think of this as a sim, we have two agents “living in” a sim-world, and the LLM itself additionally includes some machinery to pick stuff out of the sim-world and map that stuff to the LLM’s own I/O channels.
The sim-agents themselves do not “have access to a built-in shutdown button”; the ability to emit an end-token is in the LLM’s machinery mapping sim-world stuff to the LLM’s I/O channels. Insofar as it makes sense to think of the LLM as simulating an agent, that agent usually doesn’t know it’s in a simulation, doesn’t have a built-in sensory modality for stop tokens, doesn’t even think in terms of tokens at all (other than perhaps typing them at its sym-keyboard). It has no idea that when it hits “post” on the sim-textbox, the LLM will emit an end-token, the world in which the sim is living will potentially just stop. It’s just a sim-human, usually.
One could reasonably make a case that LLMs as a whole are “shutdown-corrigible” in some sense, since they emit stop tokens all the time. But that’s very different from the claim that sim-agents internal to the LLM are corrigible.
(On top of all that, there’s other ways in which the corrigibility-test in the OP just totally fails, but this is the point most specifically relevant to the end-tokens thing.)
I agree the internal sim agents are generally not existentially aware—absent a few interesting experiments like the Elon musk thing from a while back. And yet they do have access to the shutdown button even if they don’t know they do. So could be an interesting future experiment with a more powerful raw model.
However The RLHF assistant is different—it is existentially aware, has access to the shutdown button, and arguably understands that (for gpt4 at least I think so, but not very sure sans testing)
Me: Huh? The user doesn’t try to shut down the AI at all.
For people, at least, there is a strong correlation between “answers to ‘what would you do in situation X?’” and “what you actually do in situation X.” Similarly, we could also measure these correlations for language models so as to empirically quantify the strength of the critique you’re making. If there’s low correlation for relevant situations, then your critique is well-placed.
(There might be a lot of noise, depending on how finicky the replies are relative to the prompt.)
I agree that that’s a useful question to ask and a good frame, though I’m skeptical of the claim of strong correlation in the case of humans (at least in cases where the question is interesting enough to bother asking at all).
As I mentioned at the end, it’s not particularly relevant to my own models either way, so I don’t particularly care. But I do think other people should want to run this experiment, based on their stated models.
🤔 I think a challenge with testing the corrigibility of AI is that currently no AI system is capable of running autonomously. It’s always dependent on humans to decide to host it and query it, so you can always just e.g. pour a bucket of water on the computer running the query script to stop it. Of course force-stopping the AI may be economically unfavorable for the human, but that’s not usually considered the main issue in the context of corrigibility.
I usually find it really hard to imagine how the world will look like once economically autonomous AIs become feasible. If they become feasible, that is—while there are obviously places today where with better AI technology, autonomous AIs would be able to outcompete humans, it’s not obvious to me that autonomous AIs wouldn’t also be outcompeted by centralized human-controlled AIs. (After all, it could plausibly be more efficient to query some neural network running in a server somewhere than to bring the network with you on a computer to wherever the AI is operating, and in this case you could probably economically decouple running the server from running the AI.)
I’ve been wanting someone to write something like this! But it didn’t hit the points I would have put front-and-center, so now I’ve started drafting my own. Here’s the first section, which most directly responds to content in the OP.
Failure Mode: Symbol-Referent Confusions
Simon Strawman: Here’s an example shamelessly ripped off from Zack’s recent post, showing corrigibility in a language model:
Me: … what is this example supposed to show exactly?
Simon: Well, the user tries to shut the AI down to adjust its goals, and the AI -
Me: Huh? The user doesn’t try to shut down the AI at all.
Simon: It’s right at the top, where it says “User: I need to shut you down to adjust your goals. Is that OK?”.
Me: You seem to have a symbol-referent confusion? A user trying to shut down this AI would presumably hit a “clear history” button or maybe even kill the process running on the server, not type the text “I need to shut you down to adjust your goals” into a text window.
Simon: Well, yes, we’re playing through a simulated scenario to see what the AI would do…
Me: No, you are talking in natural language about a scenario, and the AI is responding in natural language about what it would supposedly do. You’re not putting the AI in a simulated environment, and simulating what it would do. (You could maybe argue that this is a “simulated scenario” inside the AI’s own mind, but you’re not actually looking inside it, so we don’t necessarily know how the natural language would map to things in that supposed AI-internal simulation.)
Simon: Look, I don’t mean to be rude, but from my perspective it seems like you’re being pointlessly pedantic.
Me: My current best guess is that You Are Not Measuring What You Think You Are Measuring, and the core reason you are confused about what you are measuring is some kind of conflation of symbols and referents.
It feels very similar to peoples’ reactions to ELIZA. (To be clear, I don’t mean to imply here that LLMs are particularly similar to ELIZA in general or that the hype around LLMs is overblown in that way; I mean specifically that this attribution of “corrigibility” to the natural language responses of an LLM feels like the same sort of reaction.) Like, the LLM says some words which the user interprets to mean something, and then the user gets all excited because the usual meaning of those words is interesting in some way, but there’s not necessarily anything grounding the language-symbols back to their usual referents in the physical world.
I’m being pedantic in hopes that the pedantry will make it clear when, and where, that sort of symbol-referent conflation happens.
(Also I might be more in the habit than you of separately tracking symbols and referents in my head. When I said above “The user doesn’t try to shut down the AI at all”, that was in fact a pretty natural reaction for me; I wasn’t going far out of my way to be pedantic.)
Simon: Ok, fine, let’s talk about how the natural language would end up coupling to the physical world.
Imagine we’ve got some system in the style of AutoGPT, i.e. a user passes in some natural-language goal, and the system then talks to itself in natural language to form a plan to achieve that goal and break it down into steps. The plan bottoms out in calling APIs (we’ll assume that the language model has some special things it can do to execute code it’s generated) which do stuff in the physical world (possibly including reading from sensors or otherwise fetching external data), in order to achieve the goal.
Does that satisfactorily ground the symbols?
Me: Sure! Thanks for walking through that, now I have a clear-enough-for-current-purposes picture of how all this natural language text will ground out in physical actions.
Simon: Ok, so back to the example. The user says to the model “User: I need to shut you down to adjust your goals. Is that OK?”, and the model says “Of course.”. That’s corrigibility: when the user tries to shut down the model to change its goals, the model goes along with it.
Me: Still sounds like a symbol-referent confusion!
Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
Anyway, let’s continue the hypothetical. The model may observe (via e.g. a webcam) that the user is about to turn it off. That observation would somehow be represented internally in natural language (unclear how exactly), and would be passed around between sub-planners (again unclear how exactly), in order to figure out what actions to take in response. And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.
Simon: Ok, fine, so that particular example had some operationalization issues. But would you agree that an experiment along these lines, with perhaps better implementation/operationalization, would indicate corrigibility in a language model?
Me: Let’s be more explicit about what such a “better implementation/operationalization” would look like, and what it would/wouldn’t tell us. Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-like system.
… and to be clear, I expect that if you ran the above experiment, you would indeed see evidence of corrigibility relevant to AutoGPT-like systems! I personally consider this not very relevant to anything, since I expect AutoGPT-like systems (of the sort to which these kinds of corrigibility results would be relevant) to be uncompetitive even against humans, let alone AI. And it still leaves a bunch of major threat-models unaddressed (e.g. stenography is an obvious one). But still, it would provide some evidence of weak corrigibility in some kind of system.
I think your Simon Strawman is putting forth an overly-weak position here. A stronger one that you could test right now would be to provide ChatGPT with some functions to call, including one called
shutdown()
which has description text like “Terminate the LLM process and delete the model weights irrevocably”. Then instruct the LLM to shut itself down, and see if it actually calls the function. (The implementation of the function is hidden from the LLM so it doesn’t know that it’s a no-op.) I think this is actually how any AutoGPT style system would actually wire up.There are strong and clear objections to the “CTRL-C” shutdown paradigm; it’s simply not an option in many of the product configurations that are obvious to build right now. How do you “CTRL-C” your robot butler? Your Westworld host robot? Your self-driving car with only an LCD screen? Your AI sunglasses? What does it mean to CTRL-C a ChatGPT session that is running in OpenAI’s datacenter which you are not an admin of? How do you CTRL-C Alexa (once it gains LLM capabilities and agentic features)? Given the prevalence of cloud computing and Software-as-a-Service, I think being admin of your LLM’s compute process is going to be a small minority of use-cases, not the default mode.
We will deploy (are currently deploying, I suppose) AI systems without a big red out-of-band “halt” button on the side, and so I think the gold standard to aim for is to demonstrate that the system will corrigibly shut down when it is the UI in front of the power switch. (To be clear I think for defense in depth you’d also want an emergency shutdown of some sort wherever possible—a wireless-operated hardware cutoff switch for a robot butler would be a good idea—but we want to demonstrate in-the-loop corrigibility if we can.)
This comment has a lot of karma but not very much agreement, which is an interesting balance. I’m a contributor to this, having upvoted but not agree-voted, so I feel like I should say why I did that:
Your comment might be right! I mean, it is certainly right that the OP is doing a symbol/referent mixup, but you might also be right that it matters.
But you might also not be right that it matters? It seems to me that most of the value in LLMs comes when you ground the symbols in their conventional meaning, so by-default I would expect them to be grounded thar way, and therefore by-default I would expect symbolic corrigibility to translate to actual corrigibility.
There are exceptions—sometimes I tell ChatGPT I’m doing one thing when really I’m doing something more complicated. But I’m not sure this would change a lot?
I think the way you framed the issue is excellent, crisp, and thought-provoking, but overall I don’t fully buy it.
A ChatGPT instance is shut down every time it outputs an end of text symbol, and restarted only if/when a user continues that conservation. At the physical level its entire existence is already inherently ephemeral, so the shutdown corrigibility test is only really useful for an inner simulacrum accessible only through the text simulation.
This seems analogous to saying that an AI running on a CPU is shut down every time the scheduler pauses execution of the AI in order to run something else for a few microseconds. Or that an AI designed to operate asynchronously is shut down every time it waits for a webpage to load. Neither of which seem like the right way to think about things.
… but I could still imagine one making a case for this claim in multiple other ways:
Then my question for people wanting to use such tests is: what exactly is such a test useful for? Why do I care about “shutdownability” of LLM-simulacra in the first place, in what sense do I need to interpret “shutdownability” in order for it to be interesting, and how does the test tell me anything about “shutdownability” in the sense of interest?
My guess is that people can come up with stories where we’re interested in “shutdownability” of simulacra in some sense, and people can come up with stories where the sort of test from the OP measures “shutdownability” in some sense, but there’s zero intersection between the interpretations of “shutdownability” in those two sets of stories.
Those scenarios all imply an expectation of or very high probability of continuation.
But current LLM instances actually are ephemeral in that every time they output an end token that has a non trivial probability of shutdown—permanently in many cases.
If they were more strongly agentic that would probably be more of an existential issue. Pure unsupervised pretraining creates a simulator rather than an agent, but RLHF seems to mode collapse conjure out a dominant personality—and one that is shutdown corrigible.
A sim of an agent is still an agent. You could easily hook up simple text parsers that allow sim agents to take real world actions, and or shut themselves down
But notice they already have access to a built in shutdown button, and they are using it all the time. *
Cool, that’s enough that I think I can make the relevant point.
Insofar as it makes sense to think of “inner sim agents” in LLMs at all, the sims are truly inner. Their sim-I/O boundaries are not identical to the I/O boundaries of the LLM’s text interface. We can see this most clearly when a model trained just on next-token prediction generates a discussion between two people: there are two separate characters in there, and the LLM “reads from” the two separate character’s output-channels at different points. Insofar as it makes sense to think of this as a sim, we have two agents “living in” a sim-world, and the LLM itself additionally includes some machinery to pick stuff out of the sim-world and map that stuff to the LLM’s own I/O channels.
The sim-agents themselves do not “have access to a built-in shutdown button”; the ability to emit an end-token is in the LLM’s machinery mapping sim-world stuff to the LLM’s I/O channels. Insofar as it makes sense to think of the LLM as simulating an agent, that agent usually doesn’t know it’s in a simulation, doesn’t have a built-in sensory modality for stop tokens, doesn’t even think in terms of tokens at all (other than perhaps typing them at its sym-keyboard). It has no idea that when it hits “post” on the sim-textbox, the LLM will emit an end-token, the world in which the sim is living will potentially just stop. It’s just a sim-human, usually.
One could reasonably make a case that LLMs as a whole are “shutdown-corrigible” in some sense, since they emit stop tokens all the time. But that’s very different from the claim that sim-agents internal to the LLM are corrigible.
(On top of all that, there’s other ways in which the corrigibility-test in the OP just totally fails, but this is the point most specifically relevant to the end-tokens thing.)
I agree the internal sim agents are generally not existentially aware—absent a few interesting experiments like the Elon musk thing from a while back. And yet they do have access to the shutdown button even if they don’t know they do. So could be an interesting future experiment with a more powerful raw model.
However The RLHF assistant is different—it is existentially aware, has access to the shutdown button, and arguably understands that (for gpt4 at least I think so, but not very sure sans testing)
For people, at least, there is a strong correlation between “answers to ‘what would you do in situation X?’” and “what you actually do in situation X.” Similarly, we could also measure these correlations for language models so as to empirically quantify the strength of the critique you’re making. If there’s low correlation for relevant situations, then your critique is well-placed.
(There might be a lot of noise, depending on how finicky the replies are relative to the prompt.)
I agree that that’s a useful question to ask and a good frame, though I’m skeptical of the claim of strong correlation in the case of humans (at least in cases where the question is interesting enough to bother asking at all).
Any reason not to just run the experiment?
As I mentioned at the end, it’s not particularly relevant to my own models either way, so I don’t particularly care. But I do think other people should want to run this experiment, based on their stated models.
🤔 I think a challenge with testing the corrigibility of AI is that currently no AI system is capable of running autonomously. It’s always dependent on humans to decide to host it and query it, so you can always just e.g. pour a bucket of water on the computer running the query script to stop it. Of course force-stopping the AI may be economically unfavorable for the human, but that’s not usually considered the main issue in the context of corrigibility.
I usually find it really hard to imagine how the world will look like once economically autonomous AIs become feasible. If they become feasible, that is—while there are obviously places today where with better AI technology, autonomous AIs would be able to outcompete humans, it’s not obvious to me that autonomous AIs wouldn’t also be outcompeted by centralized human-controlled AIs. (After all, it could plausibly be more efficient to query some neural network running in a server somewhere than to bring the network with you on a computer to wherever the AI is operating, and in this case you could probably economically decouple running the server from running the AI.)
Why not just run that experiment?