Vladimir_Nesov comments on David Udell’s Shortform

Vladimir_Nesov 21 Jul 2022 6:27 UTC
2 points

a sense of duty as their terminal goal

It seems like corrigibility can’t be usefully described as acting according to some terminal goal. But AIs are not by default expected utility maximizers in the ontology of the real world, so it could be possible to get them to do the desired thing despite lacking a sensible formal picture of it.

I’m guessing some aspects of corrigibility might be about acting according to a whole space of goals (at the same time), which is easier to usefully describe. Some quantilizer-like thing selected to more natural desiderata, acting in a particular way in accordance with a collection of goals. With the space of goals not necessarily thought of as uncertainty about an unknown goal.

plausibly will work with dumb agents

This is not about being dumb, it’s about not actually engaging in planning. Failing in this does require some level of non-dumbness, but not conversely. Unless spontaneous mesa-optimizers all over the place, the cognitive cancer, which probably takes many orders of magnitude above merely not being dumb. So for a start, train the models, not the agent.