When I’m deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself?
One potentially useful model is: I’m good at evaluating and bad at searching (after all, P≠NP). I can therefore delegate searching to another agent. But, as you point out, this doesn’t account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching.
A better model is: I’m good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL) which we can think of as an “oracle”. This allows the AI to have additional information which is purely “logical”. We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle’s outputs entangled with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it’s a good move, even without having the skill to understand why the move is good in any more detail.
Now let’s examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user’s. On the other hand, this play is not as good as the AI would achieve if it just optimized for winning at chess without any constrains. So, our AI might not be competitive with an unconstrained unaligned AI. But, this might be good enough.
I’m not sure what you’re saying in the “turning off the stars example”. If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.
Agree with the first section, though I would like to register my sentiment that although “good at selecting but missing logical facts” is a better model, it’s still not one I’d want an AI to use when inferring my values.
I’m not sure what you’re saying in the “turning off the stars example”. If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.
I think my point is if “turn off the stars” is not a primitive action, but is a set of states of the world that the AI would overwhelming like to go to, then the actual primitive actions will get evaluated based on how well they end up going to that goal state. And since the AI is better at evaluating than us, we’re probably going there.
Another way of looking at this claim is that I’m telling a story about why the safety bound on quantilizers gets worse when quantilization is iterated. Iterated quantilization has much worse bounds than quantilizing over the iterated game, which makes sense if we think of games where the AI evaluates many actions better than the human.
I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn’t deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user’s evaluation in the end of this short-term interval.
Ah. I indeed misunderstood, thanks :) I’d read “short-term quantilization” as quantilizing over short-term policies evaluated according to their expected utility. My story doesn’t make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).
When I’m deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself?
One potentially useful model is: I’m good at evaluating and bad at searching (after all, P≠NP). I can therefore delegate searching to another agent. But, as you point out, this doesn’t account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching.
A better model is: I’m good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL) which we can think of as an “oracle”. This allows the AI to have additional information which is purely “logical”. We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle’s outputs entangled with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it’s a good move, even without having the skill to understand why the move is good in any more detail.
Now let’s examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user’s. On the other hand, this play is not as good as the AI would achieve if it just optimized for winning at chess without any constrains. So, our AI might not be competitive with an unconstrained unaligned AI. But, this might be good enough.
I’m not sure what you’re saying in the “turning off the stars example”. If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.
Agree with the first section, though I would like to register my sentiment that although “good at selecting but missing logical facts” is a better model, it’s still not one I’d want an AI to use when inferring my values.
I think my point is if “turn off the stars” is not a primitive action, but is a set of states of the world that the AI would overwhelming like to go to, then the actual primitive actions will get evaluated based on how well they end up going to that goal state. And since the AI is better at evaluating than us, we’re probably going there.
Another way of looking at this claim is that I’m telling a story about why the safety bound on quantilizers gets worse when quantilization is iterated. Iterated quantilization has much worse bounds than quantilizing over the iterated game, which makes sense if we think of games where the AI evaluates many actions better than the human.
I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn’t deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user’s evaluation in the end of this short-term interval.
Ah. I indeed misunderstood, thanks :) I’d read “short-term quantilization” as quantilizing over short-term policies evaluated according to their expected utility. My story doesn’t make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).