Yeah, I don’t think it necessarily counts as misalignment. In fact, corrigibility probably looks behaviorally a lot like this: gathering ability to affect the world, without making irreversible decisions, and waiting for the overseer to direct how to cash out into ultimate effects. But the hidability means that “ultimate intents” or “deep intents” are conceptually murky, and therefore not obvious how to read off an agent—if you can discern them through behavior, what can you discern them through?
Only if we know the entire learning trajectory of AI (including the training data) and high-resolution interpretability mapping along the way. If we don’t have this, or if AI learns online and is not inspected with mech.interp tools during this process, we don’t have any ways of knowing of any “deep beliefs” that AI may have, if it doesn’t reveal them in its behavior or “thoughts” (explicit representations during inferences)
Thanks, that’s a great example!
Yeah, I don’t think it necessarily counts as misalignment. In fact, corrigibility probably looks behaviorally a lot like this: gathering ability to affect the world, without making irreversible decisions, and waiting for the overseer to direct how to cash out into ultimate effects. But the hidability means that “ultimate intents” or “deep intents” are conceptually murky, and therefore not obvious how to read off an agent—if you can discern them through behavior, what can you discern them through?
Only if we know the entire learning trajectory of AI (including the training data) and high-resolution interpretability mapping along the way. If we don’t have this, or if AI learns online and is not inspected with mech.interp tools during this process, we don’t have any ways of knowing of any “deep beliefs” that AI may have, if it doesn’t reveal them in its behavior or “thoughts” (explicit representations during inferences)