That’s not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we’d like to avoid if possible, and I don’t have any argument that that particular sort of inefficiency is necessary for corrigible behavior.
The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent’s internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent’s epistemic probabilities sane.
… but then the question is whether that subagent induces button-influencing-behavior. I don’t yet have a good argument in either direction on that question.
That’s not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we’d like to avoid if possible, and I don’t have any argument that that particular sort of inefficiency is necessary for corrigible behavior.
The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent’s internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent’s epistemic probabilities sane.
… but then the question is whether that subagent induces button-influencing-behavior. I don’t yet have a good argument in either direction on that question.