With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck.
I don’t think that’s what’s happening here, so I feel confused about this comment. I haven’t seen Gwern ‘refuse to substantiate them’. He indeed commented pretty extensively about the details of your comment.
Shutdown-seekingness has definitely been discussed a bunch over the years. It seems to come up a lot in Tool-AI adjacent discussions as well as impact measures. I also don’t have a great link here sadly, though I have really seen it discussed a lot over the last decade or so (and Gwern summarizes the basic reasons why I don’t think it’s very promising).
Also, it is clear that Gwern did not read the linked research about language agents, since it is simply false, and obviously so, to claim that the generative agents in the Stanford study are the same thing as Gato.
This seems straightforwardly correct? Maybe you have misread Gwern’s comment. He says:
They cannot be ‘much safer’ because they are the same thing: a decoder Transformer trained to predict a set of offline RL episodes. A GPT is a goal-conditioned imitation-learning DRL agent, just like Gato (which recall, trained in GPT-style on natural text as one task, just to make the relationship even clearer)
Paraphrased he says (as I understand it) “GPTs, which are where all the juice in the architectures that you are talking comes from, are ultimately the same as Gato architecturally”. This seems correct to me, the architecture is indeed basically the same. I also don’t understand how “language agents” that ultimately just leverage a language model, which is where all the agency would come from, would somehow avoid agency.
Christopher King: I believe this has been proposed before (I’m not sure what the first time was).
Gwern: This has been proposed before (as their citations indicate), and this particular proposal does not seem to introduce any particularly novel (or good) solutions.
Simon Goldstein: Is there other work you can point us to that proposes positively shutdown-seeking agents?
Gwern: No, I haven’t bothered to track the idea because it’s not useful.
I find it odd that so many people on the forum feel certain that the proposal in the post has already been made, but none are able to produce any evidence that this is so. Might the present proposal perhaps be different in important respects from prior proposals? Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
The interesting thing about language agent architectures is that they wrap a GPT in a folk-psychological agent architecture which stores beliefs and desires in natural language and recruits the GPT to interpret its environment and plan actions. The linked post argues that this has important safety implications. So pointing out that Gato is not so different from a GPT is missing the point is a way that, to my mind, is only really possible if one has not bothered to read the linked research. What is relevant is the architecture in which the GPT is embedded, not the GPT itself.
Might the present proposal perhaps be different in important respects from prior proposals? Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
Yep, that’s a big red flag I saw. It didn’t even try to explain why this proposal wouldn’t work, and straightforwardly dismissed the research when it had potentially different properties compared to past work.
Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
I mean, I definitely remember! I could summarize them, I just don’t have a link ready, since they were mostly in random comment threads. I might go through the effort of trying to search for things, but the problem is not one of remembering, but one of finding things in a see of 10 years of online discussion in which many different terms have been used to point to the relevant ideas.
The linked post argues that this has important safety implications. So pointing out that Gato is not so different from a GPT is missing the point is a way that, to my mind, is only really possible if one has not bothered to read the linked research. What is relevant is the architecture in which the GPT is embedded, not the GPT itself.
I think this is false (in that what matters is GPT itself, not the architecture within which it is embedded), though you are free to disagree with this. I don’t think it implies not having read the underlying research (I had read the relevant paper and looked at its architecture and I don’t really buy that it makes things safer in any relevant way).
My intention is not to criticize you in particular!
Let me describe my own thought process with respect to the originality of work. If I get an academic paper to referee and I suspect that it’s derivative, I treat it as my job to demonstrate this by locating a specific published work that has already proposed the same theory. If I can’t do this, I don’t criticize it for being derivative. The epistemic rationale for this is as follows: if the experts working in an area are not aware of a source that has already published the idea, then even if the idea has already been published somewhere obscure, it is useful for the epistemic community to have something new to cite in discussing it. And of course, if I’ve discussed the idea in private with my colleagues but the paper I am refereeing is the first discussion of the idea I have seen written down, my prior discussions do not show the idea isn’t original — my personal discussions don’t constitute part of the collective knowledge of the research community because I haven’t shared them publicly.
It’s probably not very fruitful to continue speculating about whether Gwern read the linked paper. It does seem to me that your disagreement directly targets our thesis in the linked paper (which is productive), whereas the disagreement I quoted above took Simon to be making the rather different claim that GPTs (considered by themselves) are not architecturally similar to Gato.
I don’t think that’s what’s happening here, so I feel confused about this comment. I haven’t seen Gwern ‘refuse to substantiate them’. He indeed commented pretty extensively about the details of your comment.
Shutdown-seekingness has definitely been discussed a bunch over the years. It seems to come up a lot in Tool-AI adjacent discussions as well as impact measures. I also don’t have a great link here sadly, though I have really seen it discussed a lot over the last decade or so (and Gwern summarizes the basic reasons why I don’t think it’s very promising).
This seems straightforwardly correct? Maybe you have misread Gwern’s comment. He says:
Paraphrased he says (as I understand it) “GPTs, which are where all the juice in the architectures that you are talking comes from, are ultimately the same as Gato architecturally”. This seems correct to me, the architecture is indeed basically the same. I also don’t understand how “language agents” that ultimately just leverage a language model, which is where all the agency would come from, would somehow avoid agency.
I’m referring to this exchange:
I find it odd that so many people on the forum feel certain that the proposal in the post has already been made, but none are able to produce any evidence that this is so. Might the present proposal perhaps be different in important respects from prior proposals? Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
The interesting thing about language agent architectures is that they wrap a GPT in a folk-psychological agent architecture which stores beliefs and desires in natural language and recruits the GPT to interpret its environment and plan actions. The linked post argues that this has important safety implications. So pointing out that Gato is not so different from a GPT is missing the point is a way that, to my mind, is only really possible if one has not bothered to read the linked research. What is relevant is the architecture in which the GPT is embedded, not the GPT itself.
Yep, that’s a big red flag I saw. It didn’t even try to explain why this proposal wouldn’t work, and straightforwardly dismissed the research when it had potentially different properties compared to past work.
I mean, I definitely remember! I could summarize them, I just don’t have a link ready, since they were mostly in random comment threads. I might go through the effort of trying to search for things, but the problem is not one of remembering, but one of finding things in a see of 10 years of online discussion in which many different terms have been used to point to the relevant ideas.
I think this is false (in that what matters is GPT itself, not the architecture within which it is embedded), though you are free to disagree with this. I don’t think it implies not having read the underlying research (I had read the relevant paper and looked at its architecture and I don’t really buy that it makes things safer in any relevant way).
My intention is not to criticize you in particular!
Let me describe my own thought process with respect to the originality of work. If I get an academic paper to referee and I suspect that it’s derivative, I treat it as my job to demonstrate this by locating a specific published work that has already proposed the same theory. If I can’t do this, I don’t criticize it for being derivative. The epistemic rationale for this is as follows: if the experts working in an area are not aware of a source that has already published the idea, then even if the idea has already been published somewhere obscure, it is useful for the epistemic community to have something new to cite in discussing it. And of course, if I’ve discussed the idea in private with my colleagues but the paper I am refereeing is the first discussion of the idea I have seen written down, my prior discussions do not show the idea isn’t original — my personal discussions don’t constitute part of the collective knowledge of the research community because I haven’t shared them publicly.
It’s probably not very fruitful to continue speculating about whether Gwern read the linked paper. It does seem to me that your disagreement directly targets our thesis in the linked paper (which is productive), whereas the disagreement I quoted above took Simon to be making the rather different claim that GPTs (considered by themselves) are not architecturally similar to Gato.
I should clarify that I think some of Gwern’s other points are valuable — I was just quite put off by the beginning of the post.