This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Perhaps we could have it recalculate past impacts?
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.