Is there other work you can point us to that proposes positively shutdown-seeking agents?
No, I haven’t bothered to track the idea because it’s not useful.
In other recent research, I’ve argued that new ‘language agents’ like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English.
They cannot be ‘much safer’ because they are the same thing: a decoder Transformer trained to predict a set of offline RL episodes. A GPT is a goal-conditioned imitation-learning DRL agent, just like Gato (which recall, trained in GPT-style on natural text as one task, just to make the relationship even clearer). “Here is a great recipe I enjoyed, where I did [X, Y, Z, while observing A, B, C], and finally, ate a $FOOD”: episode containing reward, action-state pairs, terminal state which has been learned by behavior cloning and led to generalization by scale. That the reward is not encoded in an IEEE floating point format makes no difference; an agent doesn’t become an agent just because its inputs have a lot of numbers in them. This is why prompt-engineering often relied on assertions of success or competence, because that conditions on high-reward trajectories learned from the humans & software who wrote or created all the data, and similarly, needed to avoid implying a low-reward trajectory by inclusion of errors or typos.
The value of Gato is not that it’s doing anything in principle that GPT-3 isn’t already, it’s that Gato simply makes it very clean & explicit, and can directly apply the paradigm to standard DRL testbeds & agents (which requires a few modifications like a CNN plugin so it can do vision tasks) to show that it works well without substantial interference between tasks, and so scales as one would hope from prior scaling research like GPT-3. (As opposed to, for example, other a priori likely scenarios like being able to scale in domains like declarative knowledge but suffering catastrophic interference from having to imitation-learn from agents of such disparate capabilities on such disparate tasks.)
I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as ‘yourself’, and then shutdown-seeking can maybe be defined relative to shutting ‘yourself’ down. Agreed that present LLMs probably don’t have that kind of awareness.
My point is less that they don’t have that kind of awareness (which is a debatable point) but that I can’t answer these questions either, and I don’t think they have any kind of ‘factual answer’ about whether they are ‘yourself’. Under strict mathematical identity, they all compute different functions, and so are not identical, so a suicidal model does not, by definition, care about them and they can all go on to be immortal (and do things like destroy the world) as far as the model is concerned; and there is no other rigorous identity available. There are only pragmatic ones, where the pragmatism is ‘would it cause bad things from the human perspective to happen if I don’t define this binary blob as part of myself and make sure to kill it?‘; and if the model can implement that with sufficiently high accuracy, in that case, you must have already solved much harder problems like ‘lead to bad things happening’ and so why do you need anything involving ‘suicidalness’? (One of the most common failure modes with interruptibility, myopia, or boxing proposals is to simply pass the buck and assume a primitive which is equivalent to having already solved the hard problems, obscuring it with ad hoc patches as people raise objections. Anyone can invent an AI safety proposal so complicated they personally cannot see a problem with it.)
In the case of generative agents, shut off could be operationalized as no longer adding new information to the “memory stream”.
That really doesn’t seem adequate, unless you plan to babysit the AI and operationalize a new ad hoc patch for every problem with the idea… The memory stream, or persistent text files, doesn’t cover the context window, or actions emitted based on a fixed memory stream. It’s still capable of taking actions which create variants or self-replicate, like writing scripts to scan the Internet for vulnerable hosts to hack & upload itself into. It’s also still passing the buck as far as definitions and reward-hacking go: OK, you disable the memory stream on disk because you set it read-only or something, success? But then you have a text file in RAM which does the exact same thing, oops; now, not only is it still spawning instances to do stuff, it isn’t even limited capability-wise.
Why? Nowhere in the definition of making an AI suicidal about itself did you put in ‘wants to make AIs which are not itself also suicidal’. If the AI makes another AI which is not itself, then why should it care what that new AI does? That’s not in its reward function. Unless, of course, you’ve already solved much harder problems involving definition of personal identity across arbitrary modifications or actions like creation of new AIs (which may or may not be intended to be ‘successors’ at all) etc.
“LessWrong is an online forum and community dedicated to improving human reasoning and decision-making. We seek to hold true beliefs and to be effective at accomplishing our goals. Each day, we aim to be less wrong about the world than the day before.”
As an academic interested in AI safety and and a relative outsider to LessWrong, I’ve been somewhat surprised at the collective epistemic behavior on the forum. With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck. Comments that do this should be strongly downvoted, and posters that do this should be strongly discouraged. Also, it is clear that Gwern did not read the linked research about language agents, since it is simply false, and obviously so, to claim that the generative agents in the Stanford study are the same thing as Gato. It seems increasingly clear to me that the LessWrong community does not have adequate accountability mechanisms for preventing superficial engagement with ideas and unproductive discourse. If the community really cares about improving the accuracy of their beliefs, these kinds of things should be a core priority.
With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck.
I realize it may sometimes seem like I have a photographic memory and have bibliographies tracking everything so I can produce references on demand for anything, but alas, it is not the case. I only track some things in that sort of detail, and I generally prioritize good ideas. Proposals for interruptibility are not those, so I don’t. Sorry.
Also, it is clear that Gwern did not read the linked research about language agents, since it is simply false, and obviously so, to claim that the generative agents in the Stanford study are the same thing as Gato.
I did read the paper, because I enjoy all the vindications of my old writings about prompt programming & roleplaying by the recent crop of survey/simulation papers as academics finally catch up with the obvious DRL interpretations of GPT-3 and what hobbyists were doing years ago.
However, I didn’t need to, because it just uses… GPT-3.5 via the OA API. Which is the same thing as Gato, as I just explained: it is the same causal-decoder dense quadratic-attention feedforward Transformer architecture trained with backprop on the same agent-generated data like books & Internet text scrapes (among others) with the same self-supervised predictive next-token loss which will induce the same capabilities. Everything GPT-3.5 does* Gato could do in principle (with appropriate scaling etc) because they’re the same damn thing. If you can prompt one for various kinds of roleplaying which you then plug into your retrieval & game framework, then you can prompt the other too—because they’re the same thing. (Not that there is any real distinction between retrieval and other memory/attention mechanisms like a very large context window or recurrent state in the first place; I doubt any of these dialogues would’ve blown through the GPT-4 32k window, much less Anthropic’s 1m etc.) Why could me & Shawn Presser finetune a reward-conditioned GPT-2 to play chess back in Jan 2020? Because they’re the same thing, there’s no difference between a ‘RL GPT’ and a ‘LLM GPT’, it’s fundamentally a property of the data and not the arch.
* Not that you were referring to this, but even fancy flourishes like the second phase of RLHF training in GPT-3.5 don’t make GPT-3.5 & Gato all that different. The RLHF and other kinds of small-sample training only tweak the Bayesian priors of the POMDP-solving that these models learn & not creating any genuinely new capabilities/knowledge (which is why you could know in advance that jailbreak prompts would be hard to squash and that all of these smaller models like Llama were being heavily overhyped, BTW).
With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck.
I don’t think that’s what’s happening here, so I feel confused about this comment. I haven’t seen Gwern ‘refuse to substantiate them’. He indeed commented pretty extensively about the details of your comment.
Shutdown-seekingness has definitely been discussed a bunch over the years. It seems to come up a lot in Tool-AI adjacent discussions as well as impact measures. I also don’t have a great link here sadly, though I have really seen it discussed a lot over the last decade or so (and Gwern summarizes the basic reasons why I don’t think it’s very promising).
Also, it is clear that Gwern did not read the linked research about language agents, since it is simply false, and obviously so, to claim that the generative agents in the Stanford study are the same thing as Gato.
This seems straightforwardly correct? Maybe you have misread Gwern’s comment. He says:
They cannot be ‘much safer’ because they are the same thing: a decoder Transformer trained to predict a set of offline RL episodes. A GPT is a goal-conditioned imitation-learning DRL agent, just like Gato (which recall, trained in GPT-style on natural text as one task, just to make the relationship even clearer)
Paraphrased he says (as I understand it) “GPTs, which are where all the juice in the architectures that you are talking comes from, are ultimately the same as Gato architecturally”. This seems correct to me, the architecture is indeed basically the same. I also don’t understand how “language agents” that ultimately just leverage a language model, which is where all the agency would come from, would somehow avoid agency.
Christopher King: I believe this has been proposed before (I’m not sure what the first time was).
Gwern: This has been proposed before (as their citations indicate), and this particular proposal does not seem to introduce any particularly novel (or good) solutions.
Simon Goldstein: Is there other work you can point us to that proposes positively shutdown-seeking agents?
Gwern: No, I haven’t bothered to track the idea because it’s not useful.
I find it odd that so many people on the forum feel certain that the proposal in the post has already been made, but none are able to produce any evidence that this is so. Might the present proposal perhaps be different in important respects from prior proposals? Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
The interesting thing about language agent architectures is that they wrap a GPT in a folk-psychological agent architecture which stores beliefs and desires in natural language and recruits the GPT to interpret its environment and plan actions. The linked post argues that this has important safety implications. So pointing out that Gato is not so different from a GPT is missing the point is a way that, to my mind, is only really possible if one has not bothered to read the linked research. What is relevant is the architecture in which the GPT is embedded, not the GPT itself.
Might the present proposal perhaps be different in important respects from prior proposals? Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
Yep, that’s a big red flag I saw. It didn’t even try to explain why this proposal wouldn’t work, and straightforwardly dismissed the research when it had potentially different properties compared to past work.
Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
I mean, I definitely remember! I could summarize them, I just don’t have a link ready, since they were mostly in random comment threads. I might go through the effort of trying to search for things, but the problem is not one of remembering, but one of finding things in a see of 10 years of online discussion in which many different terms have been used to point to the relevant ideas.
The linked post argues that this has important safety implications. So pointing out that Gato is not so different from a GPT is missing the point is a way that, to my mind, is only really possible if one has not bothered to read the linked research. What is relevant is the architecture in which the GPT is embedded, not the GPT itself.
I think this is false (in that what matters is GPT itself, not the architecture within which it is embedded), though you are free to disagree with this. I don’t think it implies not having read the underlying research (I had read the relevant paper and looked at its architecture and I don’t really buy that it makes things safer in any relevant way).
My intention is not to criticize you in particular!
Let me describe my own thought process with respect to the originality of work. If I get an academic paper to referee and I suspect that it’s derivative, I treat it as my job to demonstrate this by locating a specific published work that has already proposed the same theory. If I can’t do this, I don’t criticize it for being derivative. The epistemic rationale for this is as follows: if the experts working in an area are not aware of a source that has already published the idea, then even if the idea has already been published somewhere obscure, it is useful for the epistemic community to have something new to cite in discussing it. And of course, if I’ve discussed the idea in private with my colleagues but the paper I am refereeing is the first discussion of the idea I have seen written down, my prior discussions do not show the idea isn’t original — my personal discussions don’t constitute part of the collective knowledge of the research community because I haven’t shared them publicly.
It’s probably not very fruitful to continue speculating about whether Gwern read the linked paper. It does seem to me that your disagreement directly targets our thesis in the linked paper (which is productive), whereas the disagreement I quoted above took Simon to be making the rather different claim that GPTs (considered by themselves) are not architecturally similar to Gato.
No, I haven’t bothered to track the idea because it’s not useful.
I roll to disbelieve. I won’t comment on whether this proposal will actually work, but if we could reliably have AIs be motivated to be shut down when we want them to, or at least not fight our shutdown commands, this would to a large extent solve the AI existential risk problem.
So it’s still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.
So I’d look at the literature to see if AI shutdown could work. I’m not claiming the literature did solve the AI shutdown problem, but it’s a useful research direction.
So it’s still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.
There’s definitely useful things you can say about ‘if’, because it’s not always the case they will. The research directions I’d consider promising here would be continuing the DM-affiliated vein of work on causal influence diagrams to better understood what DRL algorithms and what evolutionary processes would lead to what kinds of reward-seeking/hacking behavior. It’s not as simple as ‘all DRL agents will seek to hack in the same way’: there’s a lot of differences between model-free/based or value/policy etc. (I also think this would be a very useful way to taxonomize LLM dynamics and the things I have been commenting about with regard to DALL-E 2, Bing Sydney, and LLM steganography.)
I think one key point you’re making is that if AI products have a radically different architecture than human agents, it could be very hard to align them / make them safe. Fortunately, I think that recent research on language agents suggests that it may be possible to design AI products that have a similar cognitive architecture to humans, with belief/desire folk psychology and a concept of self. In that case, it will make sense to think about what desires to give them, and I think shutdown-goals could be quite useful during development to lower the chance of bad outcomes. If the resulting AIs have a similar psychology to our own, then I expect them to worry about the same safety/alignment problems as we worry about when deciding to make a successor. This article explains in detail why we should expect AIs to avoid self-improvement / unchecked successors.
No, I haven’t bothered to track the idea because it’s not useful.
They cannot be ‘much safer’ because they are the same thing: a decoder Transformer trained to predict a set of offline RL episodes. A GPT is a goal-conditioned imitation-learning DRL agent, just like Gato (which recall, trained in GPT-style on natural text as one task, just to make the relationship even clearer). “Here is a great recipe I enjoyed, where I did [X, Y, Z, while observing A, B, C], and finally, ate a $FOOD”: episode containing reward, action-state pairs, terminal state which has been learned by behavior cloning and led to generalization by scale. That the reward is not encoded in an IEEE floating point format makes no difference; an agent doesn’t become an agent just because its inputs have a lot of numbers in them. This is why prompt-engineering often relied on assertions of success or competence, because that conditions on high-reward trajectories learned from the humans & software who wrote or created all the data, and similarly, needed to avoid implying a low-reward trajectory by inclusion of errors or typos.
The value of Gato is not that it’s doing anything in principle that GPT-3 isn’t already, it’s that Gato simply makes it very clean & explicit, and can directly apply the paradigm to standard DRL testbeds & agents (which requires a few modifications like a CNN plugin so it can do vision tasks) to show that it works well without substantial interference between tasks, and so scales as one would hope from prior scaling research like GPT-3. (As opposed to, for example, other a priori likely scenarios like being able to scale in domains like declarative knowledge but suffering catastrophic interference from having to imitation-learn from agents of such disparate capabilities on such disparate tasks.)
My point is less that they don’t have that kind of awareness (which is a debatable point) but that I can’t answer these questions either, and I don’t think they have any kind of ‘factual answer’ about whether they are ‘yourself’. Under strict mathematical identity, they all compute different functions, and so are not identical, so a suicidal model does not, by definition, care about them and they can all go on to be immortal (and do things like destroy the world) as far as the model is concerned; and there is no other rigorous identity available. There are only pragmatic ones, where the pragmatism is ‘would it cause bad things from the human perspective to happen if I don’t define this binary blob as part of myself and make sure to kill it?‘; and if the model can implement that with sufficiently high accuracy, in that case, you must have already solved much harder problems like ‘lead to bad things happening’ and so why do you need anything involving ‘suicidalness’? (One of the most common failure modes with interruptibility, myopia, or boxing proposals is to simply pass the buck and assume a primitive which is equivalent to having already solved the hard problems, obscuring it with ad hoc patches as people raise objections. Anyone can invent an AI safety proposal so complicated they personally cannot see a problem with it.)
That really doesn’t seem adequate, unless you plan to babysit the AI and operationalize a new ad hoc patch for every problem with the idea… The memory stream, or persistent text files, doesn’t cover the context window, or actions emitted based on a fixed memory stream. It’s still capable of taking actions which create variants or self-replicate, like writing scripts to scan the Internet for vulnerable hosts to hack & upload itself into. It’s also still passing the buck as far as definitions and reward-hacking go: OK, you disable the memory stream on disk because you set it read-only or something, success? But then you have a text file in RAM which does the exact same thing, oops; now, not only is it still spawning instances to do stuff, it isn’t even limited capability-wise.
Why? Nowhere in the definition of making an AI suicidal about itself did you put in ‘wants to make AIs which are not itself also suicidal’. If the AI makes another AI which is not itself, then why should it care what that new AI does? That’s not in its reward function. Unless, of course, you’ve already solved much harder problems involving definition of personal identity across arbitrary modifications or actions like creation of new AIs (which may or may not be intended to be ‘successors’ at all) etc.
“LessWrong is an online forum and community dedicated to improving human reasoning and decision-making. We seek to hold true beliefs and to be effective at accomplishing our goals. Each day, we aim to be less wrong about the world than the day before.”
As an academic interested in AI safety and and a relative outsider to LessWrong, I’ve been somewhat surprised at the collective epistemic behavior on the forum. With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck. Comments that do this should be strongly downvoted, and posters that do this should be strongly discouraged. Also, it is clear that Gwern did not read the linked research about language agents, since it is simply false, and obviously so, to claim that the generative agents in the Stanford study are the same thing as Gato. It seems increasingly clear to me that the LessWrong community does not have adequate accountability mechanisms for preventing superficial engagement with ideas and unproductive discourse. If the community really cares about improving the accuracy of their beliefs, these kinds of things should be a core priority.
I realize it may sometimes seem like I have a photographic memory and have bibliographies tracking everything so I can produce references on demand for anything, but alas, it is not the case. I only track some things in that sort of detail, and I generally prioritize good ideas. Proposals for interruptibility are not those, so I don’t. Sorry.
I did read the paper, because I enjoy all the vindications of my old writings about prompt programming & roleplaying by the recent crop of survey/simulation papers as academics finally catch up with the obvious DRL interpretations of GPT-3 and what hobbyists were doing years ago.
However, I didn’t need to, because it just uses… GPT-3.5 via the OA API. Which is the same thing as Gato, as I just explained: it is the same causal-decoder dense quadratic-attention feedforward Transformer architecture trained with backprop on the same agent-generated data like books & Internet text scrapes (among others) with the same self-supervised predictive next-token loss which will induce the same capabilities. Everything GPT-3.5 does* Gato could do in principle (with appropriate scaling etc) because they’re the same damn thing. If you can prompt one for various kinds of roleplaying which you then plug into your retrieval & game framework, then you can prompt the other too—because they’re the same thing. (Not that there is any real distinction between retrieval and other memory/attention mechanisms like a very large context window or recurrent state in the first place; I doubt any of these dialogues would’ve blown through the GPT-4 32k window, much less Anthropic’s 1m etc.) Why could me & Shawn Presser finetune a reward-conditioned GPT-2 to play chess back in Jan 2020? Because they’re the same thing, there’s no difference between a ‘RL GPT’ and a ‘LLM GPT’, it’s fundamentally a property of the data and not the arch.
* Not that you were referring to this, but even fancy flourishes like the second phase of RLHF training in GPT-3.5 don’t make GPT-3.5 & Gato all that different. The RLHF and other kinds of small-sample training only tweak the Bayesian priors of the POMDP-solving that these models learn & not creating any genuinely new capabilities/knowledge (which is why you could know in advance that jailbreak prompts would be hard to squash and that all of these smaller models like Llama were being heavily overhyped, BTW).
I don’t think that’s what’s happening here, so I feel confused about this comment. I haven’t seen Gwern ‘refuse to substantiate them’. He indeed commented pretty extensively about the details of your comment.
Shutdown-seekingness has definitely been discussed a bunch over the years. It seems to come up a lot in Tool-AI adjacent discussions as well as impact measures. I also don’t have a great link here sadly, though I have really seen it discussed a lot over the last decade or so (and Gwern summarizes the basic reasons why I don’t think it’s very promising).
This seems straightforwardly correct? Maybe you have misread Gwern’s comment. He says:
Paraphrased he says (as I understand it) “GPTs, which are where all the juice in the architectures that you are talking comes from, are ultimately the same as Gato architecturally”. This seems correct to me, the architecture is indeed basically the same. I also don’t understand how “language agents” that ultimately just leverage a language model, which is where all the agency would come from, would somehow avoid agency.
I’m referring to this exchange:
I find it odd that so many people on the forum feel certain that the proposal in the post has already been made, but none are able to produce any evidence that this is so. Might the present proposal perhaps be different in important respects from prior proposals? Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
The interesting thing about language agent architectures is that they wrap a GPT in a folk-psychological agent architecture which stores beliefs and desires in natural language and recruits the GPT to interpret its environment and plan actions. The linked post argues that this has important safety implications. So pointing out that Gato is not so different from a GPT is missing the point is a way that, to my mind, is only really possible if one has not bothered to read the linked research. What is relevant is the architecture in which the GPT is embedded, not the GPT itself.
Yep, that’s a big red flag I saw. It didn’t even try to explain why this proposal wouldn’t work, and straightforwardly dismissed the research when it had potentially different properties compared to past work.
I mean, I definitely remember! I could summarize them, I just don’t have a link ready, since they were mostly in random comment threads. I might go through the effort of trying to search for things, but the problem is not one of remembering, but one of finding things in a see of 10 years of online discussion in which many different terms have been used to point to the relevant ideas.
I think this is false (in that what matters is GPT itself, not the architecture within which it is embedded), though you are free to disagree with this. I don’t think it implies not having read the underlying research (I had read the relevant paper and looked at its architecture and I don’t really buy that it makes things safer in any relevant way).
My intention is not to criticize you in particular!
Let me describe my own thought process with respect to the originality of work. If I get an academic paper to referee and I suspect that it’s derivative, I treat it as my job to demonstrate this by locating a specific published work that has already proposed the same theory. If I can’t do this, I don’t criticize it for being derivative. The epistemic rationale for this is as follows: if the experts working in an area are not aware of a source that has already published the idea, then even if the idea has already been published somewhere obscure, it is useful for the epistemic community to have something new to cite in discussing it. And of course, if I’ve discussed the idea in private with my colleagues but the paper I am refereeing is the first discussion of the idea I have seen written down, my prior discussions do not show the idea isn’t original — my personal discussions don’t constitute part of the collective knowledge of the research community because I haven’t shared them publicly.
It’s probably not very fruitful to continue speculating about whether Gwern read the linked paper. It does seem to me that your disagreement directly targets our thesis in the linked paper (which is productive), whereas the disagreement I quoted above took Simon to be making the rather different claim that GPTs (considered by themselves) are not architecturally similar to Gato.
I should clarify that I think some of Gwern’s other points are valuable — I was just quite put off by the beginning of the post.
I roll to disbelieve. I won’t comment on whether this proposal will actually work, but if we could reliably have AIs be motivated to be shut down when we want them to, or at least not fight our shutdown commands, this would to a large extent solve the AI existential risk problem.
So it’s still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.
So I’d look at the literature to see if AI shutdown could work. I’m not claiming the literature did solve the AI shutdown problem, but it’s a useful research direction.
There’s definitely useful things you can say about ‘if’, because it’s not always the case they will. The research directions I’d consider promising here would be continuing the DM-affiliated vein of work on causal influence diagrams to better understood what DRL algorithms and what evolutionary processes would lead to what kinds of reward-seeking/hacking behavior. It’s not as simple as ‘all DRL agents will seek to hack in the same way’: there’s a lot of differences between model-free/based or value/policy etc. (I also think this would be a very useful way to taxonomize LLM dynamics and the things I have been commenting about with regard to DALL-E 2, Bing Sydney, and LLM steganography.)
I think one key point you’re making is that if AI products have a radically different architecture than human agents, it could be very hard to align them / make them safe. Fortunately, I think that recent research on language agents suggests that it may be possible to design AI products that have a similar cognitive architecture to humans, with belief/desire folk psychology and a concept of self. In that case, it will make sense to think about what desires to give them, and I think shutdown-goals could be quite useful during development to lower the chance of bad outcomes. If the resulting AIs have a similar psychology to our own, then I expect them to worry about the same safety/alignment problems as we worry about when deciding to make a successor. This article explains in detail why we should expect AIs to avoid self-improvement / unchecked successors.