In popular perception, this is how China is acting in international relations: it mostly pursues instrumental convergent goals like increasing its influence, accumulating resources, etc., while just holding its strategic goals “in the background”, without explicitly “rushing” to achieve them right now.
This reminds me of Sun Tzu’s saying, “If you wait by the river long enough, the bodies of your enemies will float by.”
Does such a strategy count as misalignment? If the beliefs held by the agent are compatible[1] with the overseer’s beliefs, I don’t think so. Agent’s understanding of the world could be deeper and therefore its cognitive horizon and the “light cone” of agency and concern (see Levin, 2022; Witkowski et al., 2023) farther or deeper than those of the overseer.
Then, the superintelligent agent could evolve its beliefs to the point that is no longer compatible with the “original” beliefs of the overseer, or even their existence. In the latter case, or if the agent fails to convince the overseers in (a simplified version of) their new beliefs, that would constitute misalignment, of course. But here we go out of scope of what is considered in the post and my comment above.
By “compatible”, I mean either coinciding or reducible with minimal problems, like general relativity could be reduced to Newtonian mechanics in certain regimes with negligible numerical divergence.
Yeah, I don’t think it necessarily counts as misalignment. In fact, corrigibility probably looks behaviorally a lot like this: gathering ability to affect the world, without making irreversible decisions, and waiting for the overseer to direct how to cash out into ultimate effects. But the hidability means that “ultimate intents” or “deep intents” are conceptually murky, and therefore not obvious how to read off an agent—if you can discern them through behavior, what can you discern them through?
Only if we know the entire learning trajectory of AI (including the training data) and high-resolution interpretability mapping along the way. If we don’t have this, or if AI learns online and is not inspected with mech.interp tools during this process, we don’t have any ways of knowing of any “deep beliefs” that AI may have, if it doesn’t reveal them in its behavior or “thoughts” (explicit representations during inferences)
In popular perception, this is how China is acting in international relations: it mostly pursues instrumental convergent goals like increasing its influence, accumulating resources, etc., while just holding its strategic goals “in the background”, without explicitly “rushing” to achieve them right now.
This reminds me of Sun Tzu’s saying, “If you wait by the river long enough, the bodies of your enemies will float by.”
Does such a strategy count as misalignment? If the beliefs held by the agent are compatible[1] with the overseer’s beliefs, I don’t think so. Agent’s understanding of the world could be deeper and therefore its cognitive horizon and the “light cone” of agency and concern (see Levin, 2022; Witkowski et al., 2023) farther or deeper than those of the overseer.
Then, the superintelligent agent could evolve its beliefs to the point that is no longer compatible with the “original” beliefs of the overseer, or even their existence. In the latter case, or if the agent fails to convince the overseers in (a simplified version of) their new beliefs, that would constitute misalignment, of course. But here we go out of scope of what is considered in the post and my comment above.
By “compatible”, I mean either coinciding or reducible with minimal problems, like general relativity could be reduced to Newtonian mechanics in certain regimes with negligible numerical divergence.
Thanks, that’s a great example!
Yeah, I don’t think it necessarily counts as misalignment. In fact, corrigibility probably looks behaviorally a lot like this: gathering ability to affect the world, without making irreversible decisions, and waiting for the overseer to direct how to cash out into ultimate effects. But the hidability means that “ultimate intents” or “deep intents” are conceptually murky, and therefore not obvious how to read off an agent—if you can discern them through behavior, what can you discern them through?
Only if we know the entire learning trajectory of AI (including the training data) and high-resolution interpretability mapping along the way. If we don’t have this, or if AI learns online and is not inspected with mech.interp tools during this process, we don’t have any ways of knowing of any “deep beliefs” that AI may have, if it doesn’t reveal them in its behavior or “thoughts” (explicit representations during inferences)