I skimmed the paper when they announced it on Twitter. It seemed like it fundamentally ignores every possibility vaguely like mesa-optimization or imitation learning, and can’t deal with things like, say, GPT-3 meta-learning agency to better predict data derived from agents (ie. humans). They leave themselves an out by handwaving away all such inconveniences as ‘iron ore agents’, but then it’s thoroughly useless and circular; “what’s an iron ore agent?” “It’s one which has dangerous outcomes due to hidden agency.” “OK, which agents are those, how can you tell AlphaZero from GPT-3 from AGI?” “Well, try them and see!”
I skimmed the paper when they announced it on Twitter. It seemed like it fundamentally ignores every possibility vaguely like mesa-optimization or imitation learning, and can’t deal with things like, say, GPT-3 meta-learning agency to better predict data derived from agents (ie. humans). They leave themselves an out by handwaving away all such inconveniences as ‘iron ore agents’, but then it’s thoroughly useless and circular; “what’s an iron ore agent?” “It’s one which has dangerous outcomes due to hidden agency.” “OK, which agents are those, how can you tell AlphaZero from GPT-3 from AGI?” “Well, try them and see!”