“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.
AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”. While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.
AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
This is true. It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary “extinction?” oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Perhaps we could have it recalculate past impacts?
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.
AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”. While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.
This is true. It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary “extinction?” oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.