I agree that something like this is plausible. To be more specific:
My favorite terms for it, which I mainly use in private thoughts and haven’t really written up much about elsewhere, is covariant concepts vs contravariant concepts. If you are familiar with type systems or category theory, you can think of it as being that sense. Basically, a contravariant concept is defined by something like a utility function or a boolean predicate or a set of constraints, while a covariant concept is defined by something like a distribution/measure or a convex set or a schematic. Covariant = output, contravariant = input, essentially. Or in terms of AI, generative models vs classifiers.
It does seem like some concepts are very anti-natural to define in a contravariant way, and very natural to define in a covariant way. In fact I would dare to go further and say that contravariant concepts usually cannot really be defined without a library of covariant concepts to define them in terms of, because you need something to “grab onto” reality, and contravariant concepts can’t really do that. This makes it seem quite plausible that covariance will play a critical role in aligned AI.
Some further things:
A simple way to handle covariant concepts is to take them as primitive. This is what e.g. generative latent variable models tend to do. The trouble is that in the real world, they are not primitive, but instead reduce into smaller-scale components. Sometimes we want the AIs to respect the sanctity of those components (e.g. don’t wirehead people) while other times we want the AI to intervene in the components (e.g. do cure aging). (And sometimes it’s controversial what we want it to do, e.g. with em uploads.) Taking them as primitive on one level of analysis does not really support novel interventions on the lower levels of analysis, nor does it really support respecting their sanctity on a higher level of analysis.
Covariance and contravariance is not totally absolute. The point of search, for instance, is basically to convert contravariant things to covariant ones.
A program that can act in the world to achieve good things would be a covariant representation of goodness. However, this program needs to be created in some way, and the world seems much to complex for engineers to handle each part. Instead it seems that we need some sort of learning process to capture it. But I have trouble thinking of learning processes that can plausibly effectively learn novel useful things without involving contravariant reasoning somewhere along the line. So I don’t think the trouble with contravariance can be avoided for alignment.
Not sure if this aligns 100% with your way of thinking about things but the framing might be helpful?
I agree that something like this is plausible. To be more specific:
My favorite terms for it, which I mainly use in private thoughts and haven’t really written up much about elsewhere, is covariant concepts vs contravariant concepts. If you are familiar with type systems or category theory, you can think of it as being that sense. Basically, a contravariant concept is defined by something like a utility function or a boolean predicate or a set of constraints, while a covariant concept is defined by something like a distribution/measure or a convex set or a schematic. Covariant = output, contravariant = input, essentially. Or in terms of AI, generative models vs classifiers.
It does seem like some concepts are very anti-natural to define in a contravariant way, and very natural to define in a covariant way. In fact I would dare to go further and say that contravariant concepts usually cannot really be defined without a library of covariant concepts to define them in terms of, because you need something to “grab onto” reality, and contravariant concepts can’t really do that. This makes it seem quite plausible that covariance will play a critical role in aligned AI.
Some further things:
A simple way to handle covariant concepts is to take them as primitive. This is what e.g. generative latent variable models tend to do. The trouble is that in the real world, they are not primitive, but instead reduce into smaller-scale components. Sometimes we want the AIs to respect the sanctity of those components (e.g. don’t wirehead people) while other times we want the AI to intervene in the components (e.g. do cure aging). (And sometimes it’s controversial what we want it to do, e.g. with em uploads.) Taking them as primitive on one level of analysis does not really support novel interventions on the lower levels of analysis, nor does it really support respecting their sanctity on a higher level of analysis.
Covariance and contravariance is not totally absolute. The point of search, for instance, is basically to convert contravariant things to covariant ones.
A program that can act in the world to achieve good things would be a covariant representation of goodness. However, this program needs to be created in some way, and the world seems much to complex for engineers to handle each part. Instead it seems that we need some sort of learning process to capture it. But I have trouble thinking of learning processes that can plausibly effectively learn novel useful things without involving contravariant reasoning somewhere along the line. So I don’t think the trouble with contravariance can be avoided for alignment.
Not sure if this aligns 100% with your way of thinking about things but the framing might be helpful?