Yes, that’s a hard part. But specifying the goal accurately is often regarded as a potential failure point. So, if I’m right that this is a simpler, easier-to-specify alignment goal, that’s progress. It also has the advantage of incorporating corrigibility as a by product; so it’s resistant to partial failure—if you can tell that something went wrong in time, the AGI can be asked to shut down.
Yes, that’s a hard part. But specifying the goal accurately is often regarded as a potential failure point. So, if I’m right that this is a simpler, easier-to-specify alignment goal, that’s progress. It also has the advantage of incorporating corrigibility as a by product; so it’s resistant to partial failure—if you can tell that something went wrong in time, the AGI can be asked to shut down.
WRT to the difficulty of using the AGI’s understanding as its terminal goal, I think it’s not trivial, but quite do-able, at least in some of the AGI architecture we can anticipate. See my two short posts Goals selected from learned knowledge: an alternative to RL alignment and The (partial) fallacy of dumb superintelligence.
Thanks, I’ll check those out.