That was in one of the links, whatever’s decided after thinking carefully for a very long time, less evilly by a living civilization and not an individual person. But my point is that this does not answer the question of what values one should directly align an AGI with, since this is not a tractable optimization target. And any other optimization target or its approximation that’s tractable is even worse if given to hard optimization. So the role of values that an AGI should be aligned with is played by things people want, the current approximations to that target, optimized-for softly, in a way that avoids goodhart’s curse, but keeps an eye on that eventual target.
That was in one of the links, whatever’s decided after thinking carefully for a very long time, less evilly by a living civilization and not an individual person.
You mean, the question is how exactly the Utility Function is calculated for humanity’s preferences? That’s part of the problem. We cannot easily fit an entirety of our preferences into a simple Utility Function (which doesn’t mean there’s no such Utility Function which perfectly captures it, but simply means that formalization of this function is not achievable at present moment). As Robert Miles once said, if we encode 10 things we value the most into SuperAI’s Utility Function, the 11th thing is as good as gone.
Can you describe what you think of when you say “humanity’s preferences”? The preferences of humans or human groups can and do conflict with each other, hence it is not just a question of complexity, right?
I’m sure there are multiple approaches for formalizing what we mean when we say “humanity’s preferences”, but intuitive understanding is enough for the sake of this discussion. In my (speculative) opinion, the problem precisely is with the complexity of the Utility Function capturing this intuitive understanding.
For simplicity let’s say there’s a single human being with transitive preferences and we want to perfectly align an AI with the preferences of this human. The cyclomatic complexity of such a perfect Utility Function can be easily higher than that of the human brain (it needs to perfectly predict the utility of every single thing “X” including all the imaginable consequences of having “X”, with all these Xes that might be unknown for humanity for now).
What should these values in the distant future be? That’s my question here.
That was in one of the links, whatever’s decided after thinking carefully for a very long time, less evilly by a living civilization and not an individual person. But my point is that this does not answer the question of what values one should directly align an AGI with, since this is not a tractable optimization target. And any other optimization target or its approximation that’s tractable is even worse if given to hard optimization. So the role of values that an AGI should be aligned with is played by things people want, the current approximations to that target, optimized-for softly, in a way that avoids goodhart’s curse, but keeps an eye on that eventual target.
Got it, thanks.
You mean, the question is how exactly the Utility Function is calculated for humanity’s preferences? That’s part of the problem. We cannot easily fit an entirety of our preferences into a simple Utility Function (which doesn’t mean there’s no such Utility Function which perfectly captures it, but simply means that formalization of this function is not achievable at present moment). As Robert Miles once said, if we encode 10 things we value the most into SuperAI’s Utility Function, the 11th thing is as good as gone.
Can you describe what you think of when you say “humanity’s preferences”? The preferences of humans or human groups can and do conflict with each other, hence it is not just a question of complexity, right?
I’m sure there are multiple approaches for formalizing what we mean when we say “humanity’s preferences”, but intuitive understanding is enough for the sake of this discussion. In my (speculative) opinion, the problem precisely is with the complexity of the Utility Function capturing this intuitive understanding.
For simplicity let’s say there’s a single human being with transitive preferences and we want to perfectly align an AI with the preferences of this human. The cyclomatic complexity of such a perfect Utility Function can be easily higher than that of the human brain (it needs to perfectly predict the utility of every single thing “X” including all the imaginable consequences of having “X”, with all these Xes that might be unknown for humanity for now).