It’s so frustrating to me that “model-utility” learning doesn’t have a guarantee. It’s like, you make an AI that has a good model of the world, you point (via extensional definition) at some things in the world and say “do things like that!” … And then the AI can learn the category “things that cause the human to include them in the extensional definition,” and create stimuli that would hack your brain if you were alive to see them.
It might need a better understanding of reference, and it might need breakthrougs in human-like concepts and matching the training distribution. But maybe it’s still near the right track?
I can definitely tap into the “This should work!” intuition, which says that there should be a way to avoid the problem without significantly changing the feedback loop—if only we could articulate to the system the mistake it is making. Yet, it seems like to address these sorts of failures you have to change the feedback loop.
What does it mean for an AI who knows a lot more about what the world is to do what a human wants?
Utility functions are likely the wrong concept (Stuart Armstrong has given a lot of reasons to think this). My suspicion is that the better concept is “what a human would want you to do in a situation”; IE, you try and extract a policy rather than a utility. That’s a little like approval-direction in flavor. A big problem: like my “human hypothesis evaluation” above, it would require the AI to construct human-understandable explanations of its potential cognitive states to the human. (“What action do I take if I’m thinking all these things?”)
What other concepts do we need to refactor? Maybe knowledge?
It’s so frustrating to me that “model-utility” learning doesn’t have a guarantee. It’s like, you make an AI that has a good model of the world, you point (via extensional definition) at some things in the world and say “do things like that!” … And then the AI can learn the category “things that cause the human to include them in the extensional definition,” and create stimuli that would hack your brain if you were alive to see them.
It might need a better understanding of reference, and it might need breakthrougs in human-like concepts and matching the training distribution. But maybe it’s still near the right track?
I can definitely tap into the “This should work!” intuition, which says that there should be a way to avoid the problem without significantly changing the feedback loop—if only we could articulate to the system the mistake it is making. Yet, it seems like to address these sorts of failures you have to change the feedback loop.
What does it mean for an AI who knows a lot more about what the world is to do what a human wants?
Utility functions are likely the wrong concept (Stuart Armstrong has given a lot of reasons to think this). My suspicion is that the better concept is “what a human would want you to do in a situation”; IE, you try and extract a policy rather than a utility. That’s a little like approval-direction in flavor. A big problem: like my “human hypothesis evaluation” above, it would require the AI to construct human-understandable explanations of its potential cognitive states to the human. (“What action do I take if I’m thinking all these things?”)
What other concepts do we need to refactor? Maybe knowledge?