Stuart_Armstrong comments on On how various plans miss the hard bits of the alignment challenge

Stuart_Armstrong 12 Jul 2022 10:13 UTC
LW: 30 AF: 12
7
AF
Hey, thanks for posting this!

And I apologise—I seem to have again failed to communicate what we’re doing here :-(

“Get the AI to ask for labels on ambiguous data”

Having the AI ask is a minor aspect of our current methods, that I’ve repeatedly tried to de-emphasise (though it does turn it to have an unexpected connection with interpretability). What we’re trying to do is:
1. Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
2. Select among these candidates to get a human-survivable ultimate reward functions.
Possible selection processes include being conservative (see here for how that might work: https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy ), asking humans and then extrapolating the process of what human-answering should idealise to (some initial thoughts on this here: https://www.lesswrong.com/posts/BeeirdrMXCPYZwgfj/the-blue-minimising-robot-and-model-splintering), removing some of the candidates on syntactic ground (e.g. wireheading, which I’ve written quite a bit on how it might be syntactically defined). There are some other approaches we’ve been considering, but they’re currently under-developed.

But all those methods will fail if the AI can’t generate human-survivable extrapolations of its reward training data. That is what we are currently most focused on. And, given our current results on toy models and a recent literature review, my impression is that there has been almost no decent applicable research done in this area to date. Our current results on HappyFaces are a bit simplistic, but, depressingly, they seem to be the best in the world in reward-function-extrapolation (and not just for image classification) :-(
- rgorman 12 Jul 2022 14:00 UTC
  7 points
  8
  Parent
  Thanks for writing this, Stuart.
  
  (For context, the email quote from me used in the dialogue above was written in a different context)