1. Training against explicit deceptiveness trains some “boundary-like” barriers which will make simple deceptive thoughts labelled as such during training difficult 2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to “there are some weird facts about the world which make some plans difficult to plan” (e.g. similar to such plans being avoided because they depend on extremely costly computations). 3. Given some set of a goal and strong enough capabilities, it seem likely the search will find unforeseen ways around the boundaries.
(the above may be different from what Nate means)
My response:
1. It’s plausible people are missing this but I have some doubts. 2. How I think you get actually non-deceptive powerful systems seems different—deception is relational property between the system and the human, so the “deception” thing can be explicitly understood as negative consequence for the world, and avoided using “normal” planning cognition. 3. Stability of this depends on what the system does with internal conflict. 4. If the system stays in some corrigibility/alignment basin, this should be stable upon reflection / various meta-cognitive modifications. Systems in the basin resist self-modifications toward being incorrigible.
Translating it to my ontology:
1. Training against explicit deceptiveness trains some “boundary-like” barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to “there are some weird facts about the world which make some plans difficult to plan” (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough capabilities, it seem likely the search will find unforeseen ways around the boundaries.
(the above may be different from what Nate means)
My response:
1. It’s plausible people are missing this but I have some doubts.
2. How I think you get actually non-deceptive powerful systems seems different—deception is relational property between the system and the human, so the “deception” thing can be explicitly understood as negative consequence for the world, and avoided using “normal” planning cognition.
3. Stability of this depends on what the system does with internal conflict.
4. If the system stays in some corrigibility/alignment basin, this should be stable upon reflection / various meta-cognitive modifications. Systems in the basin resist self-modifications toward being incorrigible.