I can definitely tap into the “This should work!” intuition, which says that there should be a way to avoid the problem without significantly changing the feedback loop—if only we could articulate to the system the mistake it is making. Yet, it seems like to address these sorts of failures you have to change the feedback loop.
What does it mean for an AI who knows a lot more about what the world is to do what a human wants?
Utility functions are likely the wrong concept (Stuart Armstrong has given a lot of reasons to think this). My suspicion is that the better concept is “what a human would want you to do in a situation”; IE, you try and extract a policy rather than a utility. That’s a little like approval-direction in flavor. A big problem: like my “human hypothesis evaluation” above, it would require the AI to construct human-understandable explanations of its potential cognitive states to the human. (“What action do I take if I’m thinking all these things?”)
What other concepts do we need to refactor? Maybe knowledge?
I can definitely tap into the “This should work!” intuition, which says that there should be a way to avoid the problem without significantly changing the feedback loop—if only we could articulate to the system the mistake it is making. Yet, it seems like to address these sorts of failures you have to change the feedback loop.
What does it mean for an AI who knows a lot more about what the world is to do what a human wants?
Utility functions are likely the wrong concept (Stuart Armstrong has given a lot of reasons to think this). My suspicion is that the better concept is “what a human would want you to do in a situation”; IE, you try and extract a policy rather than a utility. That’s a little like approval-direction in flavor. A big problem: like my “human hypothesis evaluation” above, it would require the AI to construct human-understandable explanations of its potential cognitive states to the human. (“What action do I take if I’m thinking all these things?”)
What other concepts do we need to refactor? Maybe knowledge?