I think this post dramatically overestimates the degree to which this was not already understood to be a central use case of LLMs by alignment researchers, although I guess the prospect of people actually running things like “ChaosGPT” was new to some people.
Personally, I said it out loud to people on this site a bunch of times in the context of explaining how LLMs could be used to optimize things, and the comment “GPT-10 could be turned into something dangerous with a one line bash script” has been bandied around repeatedly by at least several prominent people. Interpretability research is important for a reason!
Likewise, and I’m sure there are bunches of people who expected this sort of use. But I hadn’t thought through all of the ways this could add to capabilities, and I didn’t expect it to be quite so easy.
What I don’t think has been recognized very much are the immense upsides for initial alignment, corrigibility, and interpretability. The dialogue over at Alignment Forum does not appear to be much more difficult than natural language-based wrapper approaches would make them (TBC, I think there are still real difficulties in all of these, let alone for outer alignment, coordination, and alignment and coordination stability). I could be wrong, and everyone has been talking around the implications of this approach to avoid catalyzing it, like you and I do. But avoiding it so much as to change which problems you’re focusing on seems unlikely.
Maybe, and I hope so. It would be great if people in Deepmind, OpenAI, etc are already using better versions of wrappers. It would be nice to have someone a bit more responsible ahead of the curve of what anyone can do in days. There is some evidence that internal prompting is at use in the Bing implementation, but I don’t remember where I saw that.
I think this post dramatically overestimates the degree to which this was not already understood to be a central use case of LLMs by alignment researchers, although I guess the prospect of people actually running things like “ChaosGPT” was new to some people.
Indeed. It was obvious to me. I just never said it out loud to avoid acceleration.
Personally, I said it out loud to people on this site a bunch of times in the context of explaining how LLMs could be used to optimize things, and the comment “GPT-10 could be turned into something dangerous with a one line bash script” has been bandied around repeatedly by at least several prominent people. Interpretability research is important for a reason!
Likewise, and I’m sure there are bunches of people who expected this sort of use. But I hadn’t thought through all of the ways this could add to capabilities, and I didn’t expect it to be quite so easy.
What I don’t think has been recognized very much are the immense upsides for initial alignment, corrigibility, and interpretability. The dialogue over at Alignment Forum does not appear to be much more difficult than natural language-based wrapper approaches would make them (TBC, I think there are still real difficulties in all of these, let alone for outer alignment, coordination, and alignment and coordination stability). I could be wrong, and everyone has been talking around the implications of this approach to avoid catalyzing it, like you and I do. But avoiding it so much as to change which problems you’re focusing on seems unlikely.
Maybe, and I hope so. It would be great if people in Deepmind, OpenAI, etc are already using better versions of wrappers. It would be nice to have someone a bit more responsible ahead of the curve of what anyone can do in days. There is some evidence that internal prompting is at use in the Bing implementation, but I don’t remember where I saw that.