I’m not sure how necessary it is to explicitly aim to avoid catastrophic behavior—it seems that even a low capability corrigible agent would still know enough to avoid catastrophic behavior in practice. Of course, it would be better to have stronger guarantees against catastrophic behavior, so I certainly support research on learning from catastrophes—but if it turns out to be too hard, or impose too much overhead, it could still be fine to aim for corrigibility alone.
I do want to make a perhaps obvious note: the assumption that “there are some policies such that no matter what nature does, the resulting transcript is never catastrophic” is somewhat strong. In particular, it precludes the following scenario: the environment can do anything computable, and the oracle evaluates behavior only based on outcomes (observations). In this case, for any observation that the oracle would label as catastrophic, there is an environment that regardless of the agent’s action outputs that observation. So for this problem to be solvable, we need to either have a limit on what the environment “could do”, or an oracle that judges “catastrophe” based on the agent’s action in addition to outcomes (which I suspect will cache out to “are the actions in this transcript knowably going to cause something bad to happen”). In the latter case, it sounds like we are trying to train “robust corrigibility” as opposed to “never letting a catastrophe happen”. Do you have a sense for which of these two assumptions you would want to make?
I’m not sure how necessary it is to explicitly aim to avoid catastrophic behavior—it seems that even a low capability corrigible agent would still know enough to avoid catastrophic behavior in practice.
Paul gave a bit more motivation here: (It’s a bit confusing that these two posts are reposted here out of order. ETA on 1/28/19: Strange, the date on that repost just changed to today’s date. Yesterday it was dated November 2018.)
If powerful ML systems fail catastrophically, they may be able to quickly cause irreversible damage. To be safe, it’s not enough to have an average-case performance guarantee on the training distribution — we need to ensure that even if our systems fail on new distributions or with small probability, they will never fail too badly.
My interpretation of this is that learning with catastrophes / optimizing worst-case performance (I believe these are referring to the same thing, which is also confusing) is needed to train an agent that can be called corrigible in the first place. Without it, we could end up with an agent that looks corrigible on the training distribution, but would do something malign (“applies its intelligence in the service of an unintended goal”) after deployment.
Yeah, that makes sense, also the distinction between benign and malign failures in that post seems right. It makes much more sense that learning with catastrophes is necessary for corrigibility.
In particular, it precludes the following scenario: the environment can do anything computable, and the oracle evaluates behavior only based on outcomes (observations).
Paul explicitly writes that the oracle sees both observations and actions: ‘This oracle can be applied to arbitrary sequences of observations and actions […].’
or an oracle that judges “catastrophe” based on the agent’s action in addition to outcomes (which I suspect will cache out to “are the actions in this transcript knowably going to cause something bad to happen”)
This is also covered:
Intuitively, a transcript should only be marked catastrophic if it satisfies two conditions:
The agent made a catastrophically bad decision.
The agent’s observations are plausible: we have a right to expect the agent to be able to handle those observations.
Paul explicitly writes that the oracle sees both observations and actions: ‘This oracle can be applied to arbitrary sequences of observations and actions […].’
I know; I’m asking how the oracle would have to work in practice. Presumably at some point we will want to actually run the “learning with catastrophes algorithm”, and it will need an oracle, and I’d like to know what needs to be true of the oracle.
This is also covered
Indeed, my point with that sentence was that it sounds like we are only trying to avoid catastrophes that could have been foreseen, as opposed to literally all catastrophes as the post suggests, which is why the next sentence is:
In the latter case, it sounds like we are trying to train “robust corrigibility” as opposed to “never letting a catastrophe happen”.
“never letting a catastrophe happen” would incentivize the agent to spend a lot of resources on foreseeing catastrophes and building capacity to ward them off. This would distract from the agent’s main task. So we have to give the agent some slack. Is this what you’re getting at? The oracle needs to decide whether or not the agent can be held accountable for a catastrophe, but the article doesn’t say anything how it would do this?
The oracle needs to decide whether or not the agent can be held accountable for a catastrophe, but the article doesn’t say anything how it would do this?
Yes, basically. I’m not saying the article should specify how the oracle should do this, I’m saying that it should flag this as a necessary property of the oracle (or argue why it is not a necessary property).
I’m not sure how necessary it is to explicitly aim to avoid catastrophic behavior—it seems that even a low capability corrigible agent would still know enough to avoid catastrophic behavior in practice. Of course, it would be better to have stronger guarantees against catastrophic behavior, so I certainly support research on learning from catastrophes—but if it turns out to be too hard, or impose too much overhead, it could still be fine to aim for corrigibility alone.
I do want to make a perhaps obvious note: the assumption that “there are some policies such that no matter what nature does, the resulting transcript is never catastrophic” is somewhat strong. In particular, it precludes the following scenario: the environment can do anything computable, and the oracle evaluates behavior only based on outcomes (observations). In this case, for any observation that the oracle would label as catastrophic, there is an environment that regardless of the agent’s action outputs that observation. So for this problem to be solvable, we need to either have a limit on what the environment “could do”, or an oracle that judges “catastrophe” based on the agent’s action in addition to outcomes (which I suspect will cache out to “are the actions in this transcript knowably going to cause something bad to happen”). In the latter case, it sounds like we are trying to train “robust corrigibility” as opposed to “never letting a catastrophe happen”. Do you have a sense for which of these two assumptions you would want to make?
Paul gave a bit more motivation here: (It’s a bit confusing that these two posts are reposted here out of order. ETA on 1/28/19: Strange, the date on that repost just changed to today’s date. Yesterday it was dated November 2018.)
My interpretation of this is that learning with catastrophes / optimizing worst-case performance (I believe these are referring to the same thing, which is also confusing) is needed to train an agent that can be called corrigible in the first place. Without it, we could end up with an agent that looks corrigible on the training distribution, but would do something malign (“applies its intelligence in the service of an unintended goal”) after deployment.
Yeah, that makes sense, also the distinction between benign and malign failures in that post seems right. It makes much more sense that learning with catastrophes is necessary for corrigibility.
Paul explicitly writes that the oracle sees both observations and actions: ‘This oracle can be applied to arbitrary sequences of observations and actions […].’
This is also covered:
I know; I’m asking how the oracle would have to work in practice. Presumably at some point we will want to actually run the “learning with catastrophes algorithm”, and it will need an oracle, and I’d like to know what needs to be true of the oracle.
Indeed, my point with that sentence was that it sounds like we are only trying to avoid catastrophes that could have been foreseen, as opposed to literally all catastrophes as the post suggests, which is why the next sentence is:
“never letting a catastrophe happen” would incentivize the agent to spend a lot of resources on foreseeing catastrophes and building capacity to ward them off. This would distract from the agent’s main task. So we have to give the agent some slack. Is this what you’re getting at? The oracle needs to decide whether or not the agent can be held accountable for a catastrophe, but the article doesn’t say anything how it would do this?
Yes, basically. I’m not saying the article should specify how the oracle should do this, I’m saying that it should flag this as a necessary property of the oracle (or argue why it is not a necessary property).
I agree.