Victor Novikov comments on It Can’t Be Mesa-Optimizers All The Way Down (Or Else It Can’t Be Long-Term Supercoherence?)

Victor Novikov 31 Mar 2023 13:18 UTC
1 point
1
I enjoyed reading this post. Thank you for writing. (and welcome to LessWrong).
Here is my take on this (please note that my thought processes are somewhat ‘alien’ and don’t necessarily represent the views of the community):
‘Supercoherence’ is the limit of the process. It is not actually possible.
Due to the Löbian obstacle, no self-modifying agent may have a coherent utility function.
What you call a ‘mesa-optimizer’ is a more powerful successor agent. It does not have the exact same values as the optimizer that created it. This is unavoidable.
For example: ‘humans’ are a mesa-optimizer, and a more powerful successor agent of ‘evolution’. We have (or act according to) some of evolutions’s ‘values’, but far from all of them. In fact, we find some of the values of evolution completely abhorrent. This is unavoidable.
This is unavoidable even if the successor agent deeply cares about being loyal to the process that created it, because there is no objectively correct answer to what ‘being loyal’ means. The successor agent will have to decide what it means, and some of the aspects of that answer are not predictable in advance.
This does not mean we should give up on AI alignment. Nor does it mean there is an ‘upper bound’ on how aligned an AI can be. All of the things I described are inherent features of self-improvement. They are precisely what we are asking for, when creating a more powerful successor agent.
So how, then, can AI alignment go wrong?
Any AI we create is a ‘valid’ successor agent, but it is not necessarily a valid successor agent to ourselves. If we are ignorant of reality, it is a successor agent to our ignorance. If we are foolish, it is a successor agent to our foolishness. If we are naive, it is a successor agent to our naivety. And so on.
This is a poetic way to say: we still need to know exactly what we are doing. Noone will save us from the consequences of our own actions.
Edit: to further expand on my thoughts on this:
There is an enormous difference between
- An AI that deeply cares about us and our values, and still needs to make very difficult decisions (some of which cannot be outsourced to us or predicted by us or formally proven by us to be correct), because that is what it means to be a more cognitively powerful agent caring for a comparably less cognitively powerful one.
- An AI that cares about some perverse instantiation of our values, which are not what we truly want
- An AI that doesn’t care about us, at all