I agree. The implicit modeling assumptions make me pessimistic about simple concrete implementations.
In this post, I’m more gesturing towards a strong form of corrigibility which tends to employ this reasoning. For example, if I’m intent-aligned with you, I might I ask myself “what do i think i know, and why do i think i know it? I think I’m doing what you want, but how do I know that? What if my object-level reasoning is flawed?”. One framing for this is taking the outside view on your algorithm’s flaws in similar situations. I don’t know exactly how that should best be done (even informally), so this post is exploratory.
Sure. Humans have a sort of pessimism about their own abilities that’s fairly self-contradictory.
“My reasoning process might not be right”, interpreted as you do in the post, includes a standard of rightness that one could figure out. It seems like you could just… do the best thing, especially if you’re a self-modifying AI. Even if you have unresolvable uncertainty about what is right, you can just average over that uncertainty and take the highest-expected-rightness action.
Humans seem to remain pessimistic despite this by evaluating rightness using inconsistent heuristics, and not having enough processing power to cause too much trouble by smashing those heuristics together. I’m not convinced this is something we want to put into an AI. I guess I’m also more of an optimist about the chances to just do value learning well enough.
(Below is my response to my best understanding of your reply – let me know if you were trying to make a different point)
it can be simultaneously true that: ideal intent-aligned reasoners could just execute the expected-best policy, and that overcoming bias generally involves assessing the performance of your algorithm in a given situation, and also that it’s profitable to think about that aspect explicitly wrt corrigibility. So, I think I agree with you, but I’m interested in the heuristics that corrigible reasoning might tend to use?
I agree. The implicit modeling assumptions make me pessimistic about simple concrete implementations.
In this post, I’m more gesturing towards a strong form of corrigibility which tends to employ this reasoning. For example, if I’m intent-aligned with you, I might I ask myself “what do i think i know, and why do i think i know it? I think I’m doing what you want, but how do I know that? What if my object-level reasoning is flawed?”. One framing for this is taking the outside view on your algorithm’s flaws in similar situations. I don’t know exactly how that should best be done (even informally), so this post is exploratory.
Sure. Humans have a sort of pessimism about their own abilities that’s fairly self-contradictory.
“My reasoning process might not be right”, interpreted as you do in the post, includes a standard of rightness that one could figure out. It seems like you could just… do the best thing, especially if you’re a self-modifying AI. Even if you have unresolvable uncertainty about what is right, you can just average over that uncertainty and take the highest-expected-rightness action.
Humans seem to remain pessimistic despite this by evaluating rightness using inconsistent heuristics, and not having enough processing power to cause too much trouble by smashing those heuristics together. I’m not convinced this is something we want to put into an AI. I guess I’m also more of an optimist about the chances to just do value learning well enough.
(Below is my response to my best understanding of your reply – let me know if you were trying to make a different point)
it can be simultaneously true that: ideal intent-aligned reasoners could just execute the expected-best policy, and that overcoming bias generally involves assessing the performance of your algorithm in a given situation, and also that it’s profitable to think about that aspect explicitly wrt corrigibility. So, I think I agree with you, but I’m interested in the heuristics that corrigible reasoning might tend to use?