Analogously, we might have a value-learning agent take the outside view. If it’s about to disable the off-switch, it might realize that this is a terrible idea most of the time. That is, when you simulate your algorithm trying to learn the values of a wide range of different agents, you usually wrongly believe you should disable the off-switch.
Suppose we have an AI that extracts human preferences by modeling them as agents with a utility function over physical states of the universe (not world-histories). This is bad because then it will just try to put the world in a good state and keep it static, which isn’t what humans want.
The question is, will the OutsideView method tell it its mistake? Probably not—because the obvious way you generate the ground truth for your outside-view simulations is to sample different allowed parameters of the model you have of humans. And so the simulated humans will all have preferences over states of the universe.
In short, if your algorithm is something like RL based on a reward signal, and your OutsideView method models humans as agents, then it can help you spot problems. But if your algorithm is modeling humans and learning their preferences, then the OutsideView can’t help, because it generates humans from your model of them. So this can’t be a source of a value learning agent’s pessimism about its own righteousness.
I agree. The implicit modeling assumptions make me pessimistic about simple concrete implementations.
In this post, I’m more gesturing towards a strong form of corrigibility which tends to employ this reasoning. For example, if I’m intent-aligned with you, I might I ask myself “what do i think i know, and why do i think i know it? I think I’m doing what you want, but how do I know that? What if my object-level reasoning is flawed?”. One framing for this is taking the outside view on your algorithm’s flaws in similar situations. I don’t know exactly how that should best be done (even informally), so this post is exploratory.
Sure. Humans have a sort of pessimism about their own abilities that’s fairly self-contradictory.
“My reasoning process might not be right”, interpreted as you do in the post, includes a standard of rightness that one could figure out. It seems like you could just… do the best thing, especially if you’re a self-modifying AI. Even if you have unresolvable uncertainty about what is right, you can just average over that uncertainty and take the highest-expected-rightness action.
Humans seem to remain pessimistic despite this by evaluating rightness using inconsistent heuristics, and not having enough processing power to cause too much trouble by smashing those heuristics together. I’m not convinced this is something we want to put into an AI. I guess I’m also more of an optimist about the chances to just do value learning well enough.
(Below is my response to my best understanding of your reply – let me know if you were trying to make a different point)
it can be simultaneously true that: ideal intent-aligned reasoners could just execute the expected-best policy, and that overcoming bias generally involves assessing the performance of your algorithm in a given situation, and also that it’s profitable to think about that aspect explicitly wrt corrigibility. So, I think I agree with you, but I’m interested in the heuristics that corrigible reasoning might tend to use?
Here’s the part that’s tricky:
Suppose we have an AI that extracts human preferences by modeling them as agents with a utility function over physical states of the universe (not world-histories). This is bad because then it will just try to put the world in a good state and keep it static, which isn’t what humans want.
The question is, will the OutsideView method tell it its mistake? Probably not—because the obvious way you generate the ground truth for your outside-view simulations is to sample different allowed parameters of the model you have of humans. And so the simulated humans will all have preferences over states of the universe.
In short, if your algorithm is something like RL based on a reward signal, and your OutsideView method models humans as agents, then it can help you spot problems. But if your algorithm is modeling humans and learning their preferences, then the OutsideView can’t help, because it generates humans from your model of them. So this can’t be a source of a value learning agent’s pessimism about its own righteousness.
I agree. The implicit modeling assumptions make me pessimistic about simple concrete implementations.
In this post, I’m more gesturing towards a strong form of corrigibility which tends to employ this reasoning. For example, if I’m intent-aligned with you, I might I ask myself “what do i think i know, and why do i think i know it? I think I’m doing what you want, but how do I know that? What if my object-level reasoning is flawed?”. One framing for this is taking the outside view on your algorithm’s flaws in similar situations. I don’t know exactly how that should best be done (even informally), so this post is exploratory.
Sure. Humans have a sort of pessimism about their own abilities that’s fairly self-contradictory.
“My reasoning process might not be right”, interpreted as you do in the post, includes a standard of rightness that one could figure out. It seems like you could just… do the best thing, especially if you’re a self-modifying AI. Even if you have unresolvable uncertainty about what is right, you can just average over that uncertainty and take the highest-expected-rightness action.
Humans seem to remain pessimistic despite this by evaluating rightness using inconsistent heuristics, and not having enough processing power to cause too much trouble by smashing those heuristics together. I’m not convinced this is something we want to put into an AI. I guess I’m also more of an optimist about the chances to just do value learning well enough.
(Below is my response to my best understanding of your reply – let me know if you were trying to make a different point)
it can be simultaneously true that: ideal intent-aligned reasoners could just execute the expected-best policy, and that overcoming bias generally involves assessing the performance of your algorithm in a given situation, and also that it’s profitable to think about that aspect explicitly wrt corrigibility. So, I think I agree with you, but I’m interested in the heuristics that corrigible reasoning might tend to use?