Charlie Steiner comments on What AI safety problems need solving for safe AI research assistants?

Charlie Steiner 6 Nov 2019 9:31 UTC
2 points
It seems like the main problem is making sure nobody’s getting systematically misled. To help humans make the right updates, the AI has to communicate not only accurate results, but well-calibrated uncertainties. It also has to interact with humans in a way that doesn’t send the wrong signals (more a problem to do with humans than to do with AI).
This is very much on the near-term side of the near/long term AI safety work dichotomy. We don’t need the AI to understand deception as a category, and why it’s bad, so that it can make plans that don’t involve deceiving us. We just need its training / search process (which we expect to more or less understand) to suppress incentives for deception to an acceptable range, on a limited domain of everyday problems.
(I’m probably a bigger believer in the significance of this dichotomy than most. I think looking at an AI’s behavior and then tinkering with the training procedure to eliminate undesired behavior in the training domain is a perfectly good approach to handing near-term misalignment like overconfident advisor-chatbots, but eventually we want to switch over to a more scalable approach that will use few of the same tools.)
- John_Maxwell 12 Nov 2019 9:51 UTC
  2 points
  Parent
  I agree well-calibrated uncertainties are quite valuable, but I’m not convinced they are essential for this sort of application. For example, if my assistant tells me a story about how my proposed FAI could fail, if my assistant is overconfident in its pessimism, then the worst case is that I spend a lot of time thinking about the failure mode without seeing how it could happen (not that bad). If my assistant is underconfident, and tells me a failure mode is 5% likely when it’s really 95% likely, it still feels like my assistant is being overall helpful if the failure case is one I wasn’t previously aware of. To put it another way, if my assistant isn’t calibrated, it feels like I should just be able to ignore its probability estimates and get good use out if it.
  
  but eventually we want to switch over to a more scalable approach that will use few of the same tools.
  
  I actually think the advisor approach might be scaleable, if advisor_1 has been hand-verified, and advisor_1 verifies advisor_2, who verifies advisor_3, etc.