But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects.
This is only convincing to the extent that I buy into AUP’s notion of impact. My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) and is not analytically identical to the core thing that I care about (human ability to achieve goals that humans plausibly care about), but may well turn out to be fine if I considered it for a long time.
I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors.
I agree that the nice properties of AUP are pretty nice and demonstrate a significant advance in the state of the art for impact regularisation, and did indeed put that in my first bullet point of what I thought of AUP, although I guess I didn’t have much to say about it.
Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk.
This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways (see sibling comment) and (b) even with a good model, presumably if it’s run for a long time there might be at least one error, and I’m inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time. However, I think the stronger objection here is the ‘natural disaster’ category (which might include an actuator in the AUP agent going haywire or any number of things).
Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.
Note that AUP would not even notify humans that such a natural disaster was happening if it thought that humans would solve the natural disaster iff they were notified. In general, AFAICT, if you have a natural-disaster warning AUP agent, then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster (I think even impact verification doesn’t prevent this, if you imagine that causing a natural disaster is an unforeseen maximum of the agent’s utility function). This seems like a failure mode that impact regularisation techniques ought to prevent. I also have a different reaction to this section in the sibling comment.
My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact)
I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time.
This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways
What is “this” here (for a)?
I’m inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time
But AUP’s plans are shutdown-safe? I think I misunderstand.
then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives).
In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic.
To be frank, although I do like the fact that there’s a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is.
… (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways
What is “this” here (for a)?
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters
Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.
AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”. While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.
AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
This is true. It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary “extinction?” oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Perhaps we could have it recalculate past impacts?
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.
Technical discussion of AUP
This is only convincing to the extent that I buy into AUP’s notion of impact. My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) and is not analytically identical to the core thing that I care about (human ability to achieve goals that humans plausibly care about), but may well turn out to be fine if I considered it for a long time.
I agree that the nice properties of AUP are pretty nice and demonstrate a significant advance in the state of the art for impact regularisation, and did indeed put that in my first bullet point of what I thought of AUP, although I guess I didn’t have much to say about it.
This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways (see sibling comment) and (b) even with a good model, presumably if it’s run for a long time there might be at least one error, and I’m inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time. However, I think the stronger objection here is the ‘natural disaster’ category (which might include an actuator in the AUP agent going haywire or any number of things).
Note that AUP would not even notify humans that such a natural disaster was happening if it thought that humans would solve the natural disaster iff they were notified. In general, AFAICT, if you have a natural-disaster warning AUP agent, then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster (I think even impact verification doesn’t prevent this, if you imagine that causing a natural disaster is an unforeseen maximum of the agent’s utility function). This seems like a failure mode that impact regularisation techniques ought to prevent. I also have a different reaction to this section in the sibling comment.
I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time.
What is “this” here (for a)?
But AUP’s plans are shutdown-safe? I think I misunderstand.
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives).
In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes.
To be frank, although I do like the fact that there’s a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is.
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.
AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”. While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.
This is true. It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary “extinction?” oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.