The idea of an epistemic helper seems really interesting. The obvious problem is that now it’s incentivized to manipulate what predictions get made, and how those predictions turn out in the world.
Generally speaking, epistemic helpers / oracles seem dangerous unless they’re not agents. Here I mean that an “agent” chooses actions by planning ahead and picking actions on the basis of their predicted consequences, or is trained using an objective that’s a function of the consequences of its actions. A “non-agent” in this sense just has to pick its actions based on something other than their predicted consequences, or be trained on an objective that’s not a function of the consequences.
I agree that if you score an oracle based on how accurate it is, then it is incentivized to steer the world towards states where easy questions get asked.
I think that in these considerations it matters how powerful we assume the agent to be. You made me realize that specifying the scope and detailing the application area of the proposed approach better could have made my post more interesting. In many cases making the world more predictable may be very difficult for the agent, compared to causing the human to predict the world better. In the short term I think deploying an agentic oracle could be safe.
The idea of an epistemic helper seems really interesting. The obvious problem is that now it’s incentivized to manipulate what predictions get made, and how those predictions turn out in the world.
Generally speaking, epistemic helpers / oracles seem dangerous unless they’re not agents. Here I mean that an “agent” chooses actions by planning ahead and picking actions on the basis of their predicted consequences, or is trained using an objective that’s a function of the consequences of its actions. A “non-agent” in this sense just has to pick its actions based on something other than their predicted consequences, or be trained on an objective that’s not a function of the consequences.
I agree that if you score an oracle based on how accurate it is, then it is incentivized to steer the world towards states where easy questions get asked.
I think that in these considerations it matters how powerful we assume the agent to be. You made me realize that specifying the scope and detailing the application area of the proposed approach better could have made my post more interesting. In many cases making the world more predictable may be very difficult for the agent, compared to causing the human to predict the world better. In the short term I think deploying an agentic oracle could be safe.