And I apologise—I seem to have again failed to communicate what we’re doing here :-(
“Get the AI to ask for labels on ambiguous data”
Having the AI ask is a minor aspect of our current methods, that I’ve repeatedly tried to de-emphasise (though it does turn it to have an unexpected connection with interpretability). What we’re trying to do is:
Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
Select among these candidates to get a human-survivable ultimate reward functions.
But all those methods will fail if the AI can’t generate human-survivable extrapolations of its reward training data. That is what we are currently most focused on. And, given our current results on toy models and a recent literature review, my impression is that there has been almost no decent applicable research done in this area to date. Our current results on HappyFaces are a bit simplistic, but, depressingly, they seem to be the best in the world in reward-function-extrapolation (and not just for image classification) :-(
Hey, thanks for posting this!
And I apologise—I seem to have again failed to communicate what we’re doing here :-(
Having the AI ask is a minor aspect of our current methods, that I’ve repeatedly tried to de-emphasise (though it does turn it to have an unexpected connection with interpretability). What we’re trying to do is:
Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
Select among these candidates to get a human-survivable ultimate reward functions.
Possible selection processes include being conservative (see here for how that might work: https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy ), asking humans and then extrapolating the process of what human-answering should idealise to (some initial thoughts on this here: https://www.lesswrong.com/posts/BeeirdrMXCPYZwgfj/the-blue-minimising-robot-and-model-splintering), removing some of the candidates on syntactic ground (e.g. wireheading, which I’ve written quite a bit on how it might be syntactically defined). There are some other approaches we’ve been considering, but they’re currently under-developed.
But all those methods will fail if the AI can’t generate human-survivable extrapolations of its reward training data. That is what we are currently most focused on. And, given our current results on toy models and a recent literature review, my impression is that there has been almost no decent applicable research done in this area to date. Our current results on HappyFaces are a bit simplistic, but, depressingly, they seem to be the best in the world in reward-function-extrapolation (and not just for image classification) :-(
Thanks for writing this, Stuart.
(For context, the email quote from me used in the dialogue above was written in a different context)