Paul explicitly writes that the oracle sees both observations and actions: ‘This oracle can be applied to arbitrary sequences of observations and actions […].’
I know; I’m asking how the oracle would have to work in practice. Presumably at some point we will want to actually run the “learning with catastrophes algorithm”, and it will need an oracle, and I’d like to know what needs to be true of the oracle.
This is also covered
Indeed, my point with that sentence was that it sounds like we are only trying to avoid catastrophes that could have been foreseen, as opposed to literally all catastrophes as the post suggests, which is why the next sentence is:
In the latter case, it sounds like we are trying to train “robust corrigibility” as opposed to “never letting a catastrophe happen”.
“never letting a catastrophe happen” would incentivize the agent to spend a lot of resources on foreseeing catastrophes and building capacity to ward them off. This would distract from the agent’s main task. So we have to give the agent some slack. Is this what you’re getting at? The oracle needs to decide whether or not the agent can be held accountable for a catastrophe, but the article doesn’t say anything how it would do this?
The oracle needs to decide whether or not the agent can be held accountable for a catastrophe, but the article doesn’t say anything how it would do this?
Yes, basically. I’m not saying the article should specify how the oracle should do this, I’m saying that it should flag this as a necessary property of the oracle (or argue why it is not a necessary property).
I know; I’m asking how the oracle would have to work in practice. Presumably at some point we will want to actually run the “learning with catastrophes algorithm”, and it will need an oracle, and I’d like to know what needs to be true of the oracle.
Indeed, my point with that sentence was that it sounds like we are only trying to avoid catastrophes that could have been foreseen, as opposed to literally all catastrophes as the post suggests, which is why the next sentence is:
“never letting a catastrophe happen” would incentivize the agent to spend a lot of resources on foreseeing catastrophes and building capacity to ward them off. This would distract from the agent’s main task. So we have to give the agent some slack. Is this what you’re getting at? The oracle needs to decide whether or not the agent can be held accountable for a catastrophe, but the article doesn’t say anything how it would do this?
Yes, basically. I’m not saying the article should specify how the oracle should do this, I’m saying that it should flag this as a necessary property of the oracle (or argue why it is not a necessary property).
I agree.