It seems like corrigibility can’t be usefully described as acting according to some terminal goal. But AIs are not by default expected utility maximizers in the ontology of the real world, so it could be possible to get them to do the desired thing despite lacking a sensible formal picture of it.
I’m guessing some aspects of corrigibility might be about acting according to a whole space of goals (at the same time), which is easier to usefully describe. Some quantilizer-like thing selected to more natural desiderata, acting in a particular way in accordance with a collection of goals. With the space of goals not necessarily thought of as uncertainty about an unknown goal.
plausibly will work with dumb agents
This is not about being dumb, it’s about not actually engaging in planning. Failing in this does require some level of non-dumbness, but not conversely. Unless spontaneous mesa-optimizers all over the place, the cognitive cancer, which probably takes many orders of magnitude above merely not being dumb. So for a start, train the models, not the agent.
It seems like corrigibility can’t be usefully described as acting according to some terminal goal. But AIs are not by default expected utility maximizers in the ontology of the real world, so it could be possible to get them to do the desired thing despite lacking a sensible formal picture of it.
I’m guessing some aspects of corrigibility might be about acting according to a whole space of goals (at the same time), which is easier to usefully describe. Some quantilizer-like thing selected to more natural desiderata, acting in a particular way in accordance with a collection of goals. With the space of goals not necessarily thought of as uncertainty about an unknown goal.
This is not about being dumb, it’s about not actually engaging in planning. Failing in this does require some level of non-dumbness, but not conversely. Unless spontaneous mesa-optimizers all over the place, the cognitive cancer, which probably takes many orders of magnitude above merely not being dumb. So for a start, train the models, not the agent.