No, I haven’t bothered to track the idea because it’s not useful.
I roll to disbelieve. I won’t comment on whether this proposal will actually work, but if we could reliably have AIs be motivated to be shut down when we want them to, or at least not fight our shutdown commands, this would to a large extent solve the AI existential risk problem.
So it’s still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.
So I’d look at the literature to see if AI shutdown could work. I’m not claiming the literature did solve the AI shutdown problem, but it’s a useful research direction.
So it’s still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.
There’s definitely useful things you can say about ‘if’, because it’s not always the case they will. The research directions I’d consider promising here would be continuing the DM-affiliated vein of work on causal influence diagrams to better understood what DRL algorithms and what evolutionary processes would lead to what kinds of reward-seeking/hacking behavior. It’s not as simple as ‘all DRL agents will seek to hack in the same way’: there’s a lot of differences between model-free/based or value/policy etc. (I also think this would be a very useful way to taxonomize LLM dynamics and the things I have been commenting about with regard to DALL-E 2, Bing Sydney, and LLM steganography.)
I roll to disbelieve. I won’t comment on whether this proposal will actually work, but if we could reliably have AIs be motivated to be shut down when we want them to, or at least not fight our shutdown commands, this would to a large extent solve the AI existential risk problem.
So it’s still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.
So I’d look at the literature to see if AI shutdown could work. I’m not claiming the literature did solve the AI shutdown problem, but it’s a useful research direction.
There’s definitely useful things you can say about ‘if’, because it’s not always the case they will. The research directions I’d consider promising here would be continuing the DM-affiliated vein of work on causal influence diagrams to better understood what DRL algorithms and what evolutionary processes would lead to what kinds of reward-seeking/hacking behavior. It’s not as simple as ‘all DRL agents will seek to hack in the same way’: there’s a lot of differences between model-free/based or value/policy etc. (I also think this would be a very useful way to taxonomize LLM dynamics and the things I have been commenting about with regard to DALL-E 2, Bing Sydney, and LLM steganography.)