My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact)
I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time.
This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways
What is “this” here (for a)?
I’m inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time
But AUP’s plans are shutdown-safe? I think I misunderstand.
then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives).
In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic.
To be frank, although I do like the fact that there’s a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is.
… (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways
What is “this” here (for a)?
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters
Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.
AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”. While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.
AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
This is true. It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary “extinction?” oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Perhaps we could have it recalculate past impacts?
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.
I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time.
What is “this” here (for a)?
But AUP’s plans are shutdown-safe? I think I misunderstand.
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives).
In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes.
To be frank, although I do like the fact that there’s a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is.
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.
AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”. While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.
This is true. It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary “extinction?” oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.