It seems like the main problem is making sure nobody’s getting systematically misled. To help humans make the right updates, the AI has to communicate not only accurate results, but well-calibrated uncertainties. It also has to interact with humans in a way that doesn’t send the wrong signals (more a problem to do with humans than to do with AI).
This is very much on the near-term side of the near/long term AI safety work dichotomy. We don’t need the AI to understand deception as a category, and why it’s bad, so that it can make plans that don’t involve deceiving us. We just need its training / search process (which we expect to more or less understand) to suppress incentives for deception to an acceptable range, on a limited domain of everyday problems.
(I’m probably a bigger believer in the significance of this dichotomy than most. I think looking at an AI’s behavior and then tinkering with the training procedure to eliminate undesired behavior in the training domain is a perfectly good approach to handing near-term misalignment like overconfident advisor-chatbots, but eventually we want to switch over to a more scalable approach that will use few of the same tools.)
I agree well-calibrated uncertainties are quite valuable, but I’m not convinced they are essential for this sort of application. For example, if my assistant tells me a story about how my proposed FAI could fail, if my assistant is overconfident in its pessimism, then the worst case is that I spend a lot of time thinking about the failure mode without seeing how it could happen (not that bad). If my assistant is underconfident, and tells me a failure mode is 5% likely when it’s really 95% likely, it still feels like my assistant is being overall helpful if the failure case is one I wasn’t previously aware of. To put it another way, if my assistant isn’t calibrated, it feels like I should just be able to ignore its probability estimates and get good use out if it.
but eventually we want to switch over to a more scalable approach that will use few of the same tools.
I actually think the advisor approach might be scaleable, if advisor_1 has been hand-verified, and advisor_1 verifies advisor_2, who verifies advisor_3, etc.
It seems like the main problem is making sure nobody’s getting systematically misled. To help humans make the right updates, the AI has to communicate not only accurate results, but well-calibrated uncertainties. It also has to interact with humans in a way that doesn’t send the wrong signals (more a problem to do with humans than to do with AI).
This is very much on the near-term side of the near/long term AI safety work dichotomy. We don’t need the AI to understand deception as a category, and why it’s bad, so that it can make plans that don’t involve deceiving us. We just need its training / search process (which we expect to more or less understand) to suppress incentives for deception to an acceptable range, on a limited domain of everyday problems.
(I’m probably a bigger believer in the significance of this dichotomy than most. I think looking at an AI’s behavior and then tinkering with the training procedure to eliminate undesired behavior in the training domain is a perfectly good approach to handing near-term misalignment like overconfident advisor-chatbots, but eventually we want to switch over to a more scalable approach that will use few of the same tools.)
I agree well-calibrated uncertainties are quite valuable, but I’m not convinced they are essential for this sort of application. For example, if my assistant tells me a story about how my proposed FAI could fail, if my assistant is overconfident in its pessimism, then the worst case is that I spend a lot of time thinking about the failure mode without seeing how it could happen (not that bad). If my assistant is underconfident, and tells me a failure mode is 5% likely when it’s really 95% likely, it still feels like my assistant is being overall helpful if the failure case is one I wasn’t previously aware of. To put it another way, if my assistant isn’t calibrated, it feels like I should just be able to ignore its probability estimates and get good use out if it.
I actually think the advisor approach might be scaleable, if advisor_1 has been hand-verified, and advisor_1 verifies advisor_2, who verifies advisor_3, etc.