I don’t have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.
That’s not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we’d like to avoid if possible, and I don’t have any argument that that particular sort of inefficiency is necessary for corrigible behavior.
The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent’s internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent’s epistemic probabilities sane.
… but then the question is whether that subagent induces button-influencing-behavior. I don’t yet have a good argument in either direction on that question.
I don’t have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.
That’s not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we’d like to avoid if possible, and I don’t have any argument that that particular sort of inefficiency is necessary for corrigible behavior.
The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent’s internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent’s epistemic probabilities sane.
… but then the question is whether that subagent induces button-influencing-behavior. I don’t yet have a good argument in either direction on that question.