That makes a lot of sense, thanks for the link. It is not as dangerous of a situation as a true agent AGI as this failure mode involves a (relatively stupid) user error. I trust researchers not to make that mistake, but it seems like there is no way to safely make those systems available to the public.
A way to make this more plausible I thought of after reading this is that of accidentally making it think it’s hostile. Perhaps you make a joking remark about paperclip maximizers, or maybe it just so happens that the chat history is similar to the premise of a story about a hostile AGI in its dataset, and it thinks you’re making a reference. Suddenly, it’s trying to model an unaligned AGI. This system can then generate outputs which deceive you into doing something stupid, such as running the shell script described in the linked post, or creating a seemingly aligned AGI agent with its suggestions.
Yeah, exactly. That said I don’t think the event in the story is a “stupid” user error. It’s quite a reasonable one. Suppose nobody considered this problem and this language model was installed in a next-gen smart home assistant, and someone asked it to order them the best possible pizza… in general, I think it’s dangerous to assume anyone is “smart enough” to avoid anything, because if common sense was common the world would make more sense.
That makes a lot of sense, thanks for the link. It is not as dangerous of a situation as a true agent AGI as this failure mode involves a (relatively stupid) user error. I trust researchers not to make that mistake, but it seems like there is no way to safely make those systems available to the public.
A way to make this more plausible I thought of after reading this is that of accidentally making it think it’s hostile. Perhaps you make a joking remark about paperclip maximizers, or maybe it just so happens that the chat history is similar to the premise of a story about a hostile AGI in its dataset, and it thinks you’re making a reference. Suddenly, it’s trying to model an unaligned AGI. This system can then generate outputs which deceive you into doing something stupid, such as running the shell script described in the linked post, or creating a seemingly aligned AGI agent with its suggestions.
Yeah, exactly. That said I don’t think the event in the story is a “stupid” user error. It’s quite a reasonable one. Suppose nobody considered this problem and this language model was installed in a next-gen smart home assistant, and someone asked it to order them the best possible pizza… in general, I think it’s dangerous to assume anyone is “smart enough” to avoid anything, because if common sense was common the world would make more sense.