So it’s still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.
There’s definitely useful things you can say about ‘if’, because it’s not always the case they will. The research directions I’d consider promising here would be continuing the DM-affiliated vein of work on causal influence diagrams to better understood what DRL algorithms and what evolutionary processes would lead to what kinds of reward-seeking/hacking behavior. It’s not as simple as ‘all DRL agents will seek to hack in the same way’: there’s a lot of differences between model-free/based or value/policy etc. (I also think this would be a very useful way to taxonomize LLM dynamics and the things I have been commenting about with regard to DALL-E 2, Bing Sydney, and LLM steganography.)
There’s definitely useful things you can say about ‘if’, because it’s not always the case they will. The research directions I’d consider promising here would be continuing the DM-affiliated vein of work on causal influence diagrams to better understood what DRL algorithms and what evolutionary processes would lead to what kinds of reward-seeking/hacking behavior. It’s not as simple as ‘all DRL agents will seek to hack in the same way’: there’s a lot of differences between model-free/based or value/policy etc. (I also think this would be a very useful way to taxonomize LLM dynamics and the things I have been commenting about with regard to DALL-E 2, Bing Sydney, and LLM steganography.)