An Oracle is a design for potentially high power artificial intelligences (AIs), where the AI is made safe by restricting it to only answer questions. Unfortunately most designs cause the Oracle to be motivated to manipulate humans with the contents of their answers, and Oracles of potentially high intelligence might be very successful at this. Solving that problem, without compromising the accuracy of the answer, is tricky. This paper reduces the issue to a cryptographic-style problem of Alice ensuring that her Oracle answers her questions while not providing key information to an eavesdropping Eve. Two Oracle designs solve this problem, one counterfactual (the Oracle answers as if it expected its answer to never be read) and one on-policy, but limited by the quantity of information it can transmit.
I’m unsure whether the problems addressed in your paper are sufficient for resolving the causal Goodhart concerns, since I need to think much more about the way the reward function is defined, but it seems it might not. This question is really important for the follow-on work on adversarial Goodhart, and I’m still trying to figure out how to characterize the metrics / reward functions that are and are not susceptible to corruption in this way. Perhaps a cryptographic approach solve parts of the problem
Good and Safe use of AI Oracles: https://arxiv.org/abs/1711.05541
Very interesting work!
You might like my new post that explains why I think only Oracles can resolve the problems of Causal Goodhart-like issues; https://www.lesserwrong.com/posts/iK2F9QDZvwWinsBYB/non-adversarial-goodhart-and-ai-risks
I’m unsure whether the problems addressed in your paper are sufficient for resolving the causal Goodhart concerns, since I need to think much more about the way the reward function is defined, but it seems it might not. This question is really important for the follow-on work on adversarial Goodhart, and I’m still trying to figure out how to characterize the metrics / reward functions that are and are not susceptible to corruption in this way. Perhaps a cryptographic approach solve parts of the problem
Accepted!