I think it’s working on one part of the problem, while other parts remain. If I were to be equally uncharitable, I’d say you seem to assume that if you can’t solve everything all at once, you shouldn’t say anything.
That is completely fair, and I was being uncharitable (which is evidently what happens when I post before I have my coffee, apologies.)
I do worry that we’re not being clear enough that we don’t have solutions for this worryingly near-term problem, and think that there’s far too little public recognition that this is a hard or even unsolvable problem.
This seems to assume that we solve various Goodhart’s law and deception problems
I think it’s working on one part of the problem, while other parts remain. If I were to be equally uncharitable, I’d say you seem to assume that if you can’t solve everything all at once, you shouldn’t say anything.
I don’t actually think you assume that.
What I do think is that Instruction-following AGI is easier and more likely than value aligned AGI, and that’s a route to solving goodharting and deception. It’s complex and unfinished, like every other proposed approach to avoiding death by AGI. You might like more meticulous detail; if so see Max Harms’ admirably detailed corrigibility as singular target (CAST) sequence on a very similar alignment target and approach to solving goodharting and deception.
That is completely fair, and I was being uncharitable (which is evidently what happens when I post before I have my coffee, apologies.)
I do worry that we’re not being clear enough that we don’t have solutions for this worryingly near-term problem, and think that there’s far too little public recognition that this is a hard or even unsolvable problem.