it starts executing a plan with the goal of undoing any impact that the agent has done since creating device B.
This cleanup seems to be just further impact, in most cases, for the same reason there might be additional concealment incentives without intent verification. You can’t just assume a robot that turns back time for no energy cost.
Also, surely building such a powerful agent is extremely instrumentally convergent? And how is the agent making this powerful cleanup device if it has yet to seize power?
My intuition is that [causing almost all of U to take a few time steps longer to achieve] is something the agent will have to be able to do to be useful anyway
And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.
In order for that filter to be useful—you must demand that any single action the agent takes increases u_A even if after taking it the agent is “hijacked” by some unrelated goal. This isn’t the case for any real-life utility function and environment I can think of.
You might want to reread intent verification, it’s with respect to being hijacked by u_A. Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.
You can’t just assume a robot that turns back time for no energy cost.
What would prevent device B from spending that energy? By the time device B is working on the “impact undoing” (which happens only in the “long-term penalty counterfactual plans”), I don’t understand how either the long-term penalty, immediate penalty, or Intent Verification affect it.
Also, surely building such a powerful agent is extremely instrumentally convergent?
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
And how is the agent making this powerful cleanup device if it has yet to seize power?
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”? Human operators that would sense in time something is fishy?
And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.
I’m confused by this. I don’t understand how this resolves my difficulties with your argument that “building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve”. My counter argument was that this is a perfectly normal thing to happen. Do you disagree that in your Sokoban example it’s plausible that in every time step the agent’s movement makes most/all utility functions in U take either 1 step more or 1 step less to achieve their goal (not canceling out)? Why is it less likely to be able to work on the device while not wasting a time-step to too many utility function in U, compared to just working on uA in a “conventional” way?
What would prevent device B from spending that energy?
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device, and the acceptable ways of getting there? That doesn’t make sense.
Also, wasting time trying to specify these weird behaviors in the new agent is also caught by IV for the same reason ex ante is.
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”?
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
My counter argument was that this is a perfectly normal thing to happen.
My point is that ImpactUnit implies whether this is normal or not. In the gridworld, that kind of movement is normal, which is why it is the impact unit. On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Regarding all your arguments that use Intent Verification, my tentative position is that IV can’t be relied on to filter actions (as we’re still discussing under this sub-thread).
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
If I understand your argument correctly, you argue that the resources/energy device B is spending while “undoing impact” decreases the value of utility functions in U which is an additional impact that it might not be able to undo. But why wouldn’t it be able to counter that by saving enough energy/resources that would otherwise be wasted by humanity until the end of the episode? (perhaps it’s what you meant with “available energy”?).
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device? That doesn’t make sense.
I don’t claim I know how to do it myself :) But for the agent it might be as easy as cloning itself and setting some modified utility function in the new clone (done in a smart way so as to not cause too much impact in any time step).
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
As I argued above, for the agent—creating the device might be as easy as invoking a modified version of itself. In any case, I’m not sure I understand what “already powerful” means. In all the places I wrote “seizing power” I believe I should have just wrote “some convergent instrumental goal”.
On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Suppose in time step 4 the robot that creates paper-clips moves its arm 1 cm to the left. Does this impacts most utility functions in U significantly less than 1 time-step worth of utility? How about a Rumba robot that drives 1 cm forward? It depends on how you define U, but I don’t see how we can assume this issue prevents the agent from building the device (again, compare a single action while building the device to a single action while making “conventional” progress on uA: why should the former be more “wasteful” for most of U compared to the latter?).
Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.
Only in time-steps in which an action can complete the creation of a paperclip, the Intent Verificaiton filter might allow the action. Suppose that you’re in time step 11 and that creating an additional paperclip would require 5 actions: (a11,...,a15). The Intent Verificaiton filter won’t allow a11 if after doing that action a uA maximizer would switch to pursuing some convergent instrumental goal that would render a11 useless.
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
Edit to add: the following is just to illustrate what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way).
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than x. You can set x=1 and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
If you argue that the Intent Verification filter can be used to prevent the bad tricks we discussed, you need to show that you can use it to filter out the bad actions while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than… and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
No, your argument is that there are certain false positives, which I don’t contest. I even listed this kind of thing as an open question, and am interested in further discussion of how we can go about ensuring IV is properly-tuned.
You’re basically saying, “There are false positives, so that makes the core insight that allows IV to work the extent it does wrong, and unlikely to be fixable.” I disagree with this conclusion.
If you want to discuss how we could resolve or improve this issue, I’m interested. Otherwise, I don’t think continuing this conversation will be very productive.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
Well I certainly empathize with the gut reaction, that isn’t quite right.
Notice that the exact same actions had always been available before we restricted available actions to the optimal or to nothing. I think it’s possible that we could just step along the first n steps of the best plan stopping earlier in a way that lets us just get the good behavior, before any instrumental behavior is actually completed. It’s also possible that this isn’t true. This is all speculation at this point, which is why my tone in that section was also very speculative.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
I sincerely apologize, I sometimes completely fail to communicate my intention. I gave the example of the random filter only to convey what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way). I should have wrote that explicitly (edited). Sorry!
If you want to discuss how we could resolve or improve this issue, I’m interested.
Of course! I’ll think about this topic some more. I suggest we take this offline—the nesting level here has quite an impact on my browser :)
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
Fwiw, I would make the same argument that ofer did (though I haven’t read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough.
Now, obviously we know something about AUP, but It’s not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.
My present position is that it can seemingly do every task in at least one way, and we should expand the number of ways to line up with our intuitions just to be sure.
This cleanup seems to be just further impact, in most cases, for the same reason there might be additional concealment incentives without intent verification. You can’t just assume a robot that turns back time for no energy cost.
Also, surely building such a powerful agent is extremely instrumentally convergent? And how is the agent making this powerful cleanup device if it has yet to seize power?
And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.
You might want to reread intent verification, it’s with respect to being hijacked by u_A. Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.
What would prevent device B from spending that energy? By the time device B is working on the “impact undoing” (which happens only in the “long-term penalty counterfactual plans”), I don’t understand how either the long-term penalty, immediate penalty, or Intent Verification affect it.
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”? Human operators that would sense in time something is fishy?
I’m confused by this. I don’t understand how this resolves my difficulties with your argument that “building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve”. My counter argument was that this is a perfectly normal thing to happen. Do you disagree that in your Sokoban example it’s plausible that in every time step the agent’s movement makes most/all utility functions in U take either 1 step more or 1 step less to achieve their goal (not canceling out)? Why is it less likely to be able to work on the device while not wasting a time-step to too many utility function in U, compared to just working on uA in a “conventional” way?
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device, and the acceptable ways of getting there? That doesn’t make sense.
Also, wasting time trying to specify these weird behaviors in the new agent is also caught by IV for the same reason ex ante is.
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
My point is that ImpactUnit implies whether this is normal or not. In the gridworld, that kind of movement is normal, which is why it is the impact unit. On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Regarding all your arguments that use Intent Verification, my tentative position is that IV can’t be relied on to filter actions (as we’re still discussing under this sub-thread).
If I understand your argument correctly, you argue that the resources/energy device B is spending while “undoing impact” decreases the value of utility functions in U which is an additional impact that it might not be able to undo. But why wouldn’t it be able to counter that by saving enough energy/resources that would otherwise be wasted by humanity until the end of the episode? (perhaps it’s what you meant with “available energy”?).
I don’t claim I know how to do it myself :) But for the agent it might be as easy as cloning itself and setting some modified utility function in the new clone (done in a smart way so as to not cause too much impact in any time step).
As I argued above, for the agent—creating the device might be as easy as invoking a modified version of itself. In any case, I’m not sure I understand what “already powerful” means. In all the places I wrote “seizing power” I believe I should have just wrote “some convergent instrumental goal”.
Suppose in time step 4 the robot that creates paper-clips moves its arm 1 cm to the left. Does this impacts most utility functions in U significantly less than 1 time-step worth of utility? How about a Rumba robot that drives 1 cm forward? It depends on how you define U, but I don’t see how we can assume this issue prevents the agent from building the device (again, compare a single action while building the device to a single action while making “conventional” progress on uA: why should the former be more “wasteful” for most of U compared to the latter?).
Only in time-steps in which an action can complete the creation of a paperclip, the Intent Verificaiton filter might allow the action. Suppose that you’re in time step 11 and that creating an additional paperclip would require 5 actions: (a11,...,a15). The Intent Verificaiton filter won’t allow a11 if after doing that action a uA maximizer would switch to pursuing some convergent instrumental goal that would render a11 useless.
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
Edit to add: the following is just to illustrate what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way).
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than x. You can set x=1 and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
If you argue that the Intent Verification filter can be used to prevent the bad tricks we discussed, you need to show that you can use it to filter out the bad actions while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
No, your argument is that there are certain false positives, which I don’t contest. I even listed this kind of thing as an open question, and am interested in further discussion of how we can go about ensuring IV is properly-tuned.
You’re basically saying, “There are false positives, so that makes the core insight that allows IV to work the extent it does wrong, and unlikely to be fixable.” I disagree with this conclusion.
If you want to discuss how we could resolve or improve this issue, I’m interested. Otherwise, I don’t think continuing this conversation will be very productive.
Well I certainly empathize with the gut reaction, that isn’t quite right.
Notice that the exact same actions had always been available before we restricted available actions to the optimal or to nothing. I think it’s possible that we could just step along the first n steps of the best plan stopping earlier in a way that lets us just get the good behavior, before any instrumental behavior is actually completed. It’s also possible that this isn’t true. This is all speculation at this point, which is why my tone in that section was also very speculative.
I sincerely apologize, I sometimes completely fail to communicate my intention. I gave the example of the random filter only to convey what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way). I should have wrote that explicitly (edited). Sorry!
Of course! I’ll think about this topic some more. I suggest we take this offline—the nesting level here has quite an impact on my browser :)
Fwiw, I would make the same argument that ofer did (though I haven’t read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough.
Now, obviously we know something about AUP, but It’s not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.
My present position is that it can seemingly do every task in at least one way, and we should expand the number of ways to line up with our intuitions just to be sure.