Personally, I said it out loud to people on this site a bunch of times in the context of explaining how LLMs could be used to optimize things, and the comment “GPT-10 could be turned into something dangerous with a one line bash script” has been bandied around repeatedly by at least several prominent people. Interpretability research is important for a reason!
Likewise, and I’m sure there are bunches of people who expected this sort of use. But I hadn’t thought through all of the ways this could add to capabilities, and I didn’t expect it to be quite so easy.
What I don’t think has been recognized very much are the immense upsides for initial alignment, corrigibility, and interpretability. The dialogue over at Alignment Forum does not appear to be much more difficult than natural language-based wrapper approaches would make them (TBC, I think there are still real difficulties in all of these, let alone for outer alignment, coordination, and alignment and coordination stability). I could be wrong, and everyone has been talking around the implications of this approach to avoid catalyzing it, like you and I do. But avoiding it so much as to change which problems you’re focusing on seems unlikely.
Indeed. It was obvious to me. I just never said it out loud to avoid acceleration.
Personally, I said it out loud to people on this site a bunch of times in the context of explaining how LLMs could be used to optimize things, and the comment “GPT-10 could be turned into something dangerous with a one line bash script” has been bandied around repeatedly by at least several prominent people. Interpretability research is important for a reason!
Likewise, and I’m sure there are bunches of people who expected this sort of use. But I hadn’t thought through all of the ways this could add to capabilities, and I didn’t expect it to be quite so easy.
What I don’t think has been recognized very much are the immense upsides for initial alignment, corrigibility, and interpretability. The dialogue over at Alignment Forum does not appear to be much more difficult than natural language-based wrapper approaches would make them (TBC, I think there are still real difficulties in all of these, let alone for outer alignment, coordination, and alignment and coordination stability). I could be wrong, and everyone has been talking around the implications of this approach to avoid catalyzing it, like you and I do. But avoiding it so much as to change which problems you’re focusing on seems unlikely.