I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.
What you can do when you can write down stable utility functions—after you have solved the self-modification stability problem—but you can’t yet write down CEV—is a whole different topic from this sort of AI-boxing!
My claim is that it is easier to write down some stable utility functions than others. This is intimately related to the OP, because I am claiming as a virtue of my approach to boxing that it leaves us with the problem of getting an AI to follow essentially any utility function consistently. I am not purporting to solve that problem here, just making the claim that it is obviously no harder and almost obviously strictly easier than friendliness.
What you can do when you can write down stable utility functions—after you have solved the self-modification stability problem—but you can’t yet write down CEV—is a whole different topic from this sort of AI-boxing!
I don’t think I understand this post.
My claim is that it is easier to write down some stable utility functions than others. This is intimately related to the OP, because I am claiming as a virtue of my approach to boxing that it leaves us with the problem of getting an AI to follow essentially any utility function consistently. I am not purporting to solve that problem here, just making the claim that it is obviously no harder and almost obviously strictly easier than friendliness.