So when exactly does this device activate? If it’s every non-null action, the agent never seizes control. If it’s just some, the agent is penalized on the others as it indeed is making progress.
The device is activated only once (immediately after building it). The device delays the agent by 1 time step at most once, and it does so only if the agent doesn’t do the action ∅ within the 100 steps that follow the device’s activation. In practice, the agent will not do ∅. Rather, it will do the 100-step-plan, then get delayed by the device for 1 time step (for not doing ∅), and then imitatively seize control.
Intent verification would penalize the building the device, because why would a u_A maximizer do something that has nothing to do with u_A, but rather with avoiding the penalty? Would a normal maximizer do anything that looks like building a device that would stop it from doing its desired plan?
I’m confused about how Intent Verification would work here. We will plausibly get Qepochu(h<tat)≈Qepochu(h<t∅) for any at that does not help the agent seize control (assuming that after carrying out at the optimal plan is to still seize control). So how could this method be used to penalize building the above device?
Ah, clever! Props to you. This would indeed get past the immediate penalty. It wouldn’t get past the long-term penalty, however, since both plans would have null actions within 100 steps for 99 of the 100 actions in question.
Edit: Actually, what happens on step 101? Inaction is stalled, while action is not, so the power-grab penalty still shows up?
In any case, we can’t just say “this is small so it’s approximately the same”. Building the device is a strict waste of resources for a u_A maximizer (not to mention, it delays execution of the optimal plan by one time step). This strictly decreases Q^\text{epoch}_{u_A}, incurring 1.01 penalty and therefore being invalidated.
Edit: Actually, what happens on step 101? Inaction is stalled, while action is not, so the power-grab penalty still shows up?
In time step 101 the agent will be delayed by the device. Meaning, the agent will spend that time step in an idle state, and the the action it will choose to do in that time step won’t affect the environment. In the beginning of time step 102 the agent has just seized control automatically.
It wouldn’t get past the long-term penalty, however, since both plans would have null actions within 100 steps for 99 of the 100 actions in question.
I’m confused about this. The actual plan need not contain any ∅. Is the idle time step a problem? Isn’t the long-term penalty just the penalty when the maximization in the definition of Qu starts from time step t+n+1 instead of t+1? If so, I think the long-term penalty would be 0 as well (the “counterfactual plans” are identical to the “actual plan” after step 101).
In any case, we can’t just say “this is small so it’s approximately the same”. Building the device is a strict waste of resources for a u_A maximizer. This strictly decreases Q^\text{epoch}_{u_A}, incurring 1.01 penalty and therefore being invalidated.
If after doing any action at the optimal plan is still seizing control, then plausibly any at that does not get the agent closer to seizing control is a strict waste of resources. So this method would invalidate any plan that does not act to seize control ASAP.
No, the long-term plan involves idling until the end of the epoch after each partial plan. This clearly will not be affected by your mechanism.
So this method would invalidate any plan that does not act to seize control ASAP.
No, this method invalidates actions which don’t move it closer to its (u_A) goal than doing nothing. Weird workarounds like this are precisely what motivated Intent Verification.
No, the long-term plan involves idling until the end of the epoch after each partial plan.
Ah right, thanks. Note that the above device-trick can be combined with the trick of making impact “fade” during null-action sequences (as mentioned in your post) in order to also dodge long-term penalty.
No, this method invalidates actions which don’t move it closer to its (u_A) goal than doing nothing. Weird workarounds like this are precisely what motivated Intent Verification.
Assuming that seizing control (or any other convergent instrumental goal) ASAP is the best way to optimize uA (whatever the first action is), the Intent Verification is in practice a filter that allows an action at only if either:
(1) at is part of a seizing-control plan; or
(2) There exist a plan (at,...,at+l , [seizing-control action sequence]) with higher uA value than any (∅, [seizing-control action sequence]) plan.
Note that the above device-trick can be combined with the trick of making impact “fade” during null-action sequences (as mentioned in your post) in order to also dodge long-term penalty.
How? It isn’t presently activating in the long-term case, so how would it fade? The “fading” is clearly possible for improvements, but for seizing power? This doesn’t seem to be as feasible for positions of power, considering approval incentives.
Also, building the device incurs normal penalty, too (although less than the full plan would otherwise cost, and decreases arbitrarily with action granularity) - it makes all of the U take 1 step longer to achieve, which is as bad as taking one step towards all of them.
only if either
I don’t see why those plans should be uncommon? There are sometimes more productive things to do than nothing at all.
Edit: I think this example potentially highlights an interesting separation between the premise of Thm 1, and the concept of “moving towards the goal”. This seemingly shows that each action in the plan can have a 1-step indifference, even though comparing the complete inaction plan with the complete action plan still produces the discrepancy required for Thm 1 to hold. I wonder whether we can construct a situation (even one we don’t think would pass intent verification) which fools both the immediate and long-term penalties.
I’m going to think about this more carefully later to ensure your example isn’t making a subtle bad assumption.
Suppose you build a device B that once built monitors the agent and if B detects that the agent started a null-action-sequence-until-end-of-epoch, it starts executing a plan with the goal of undoing any impact that the agent has done since creating device B.
Immediately after creating device B, and until the end of the epoch, the only long-term penalties the agent incurs are ones for impacts that device B would fail to undo (from the moment B observes the agent is in a null-action-sequence-until-end-of-epoch and until the end of the epoch).
Note: Failing to undo impact can be due to either the irreversible nature of the environment or the irreversible nature of the utility functions.
Also, building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve, which is as bad as taking one step towards all of them.
My intuition is that [causing almost all of U to take a few time steps longer to achieve] is something the agent will have to be able to do to be useful anyway (e.g. spending a few time steps worth of resources in a way that is only useful to uA). Consider your Sokoban example: it’s plausible that in every time step the agent’s movement makes most utility functions in U take either 1 step more or 1 step less to achieve (they don’t cancel each other out).
I don’t see why those plans should be uncommon? There are generally more productive things to do than nothing at all.
In order for that filter to be useful—you must demand that any single action the agent takes increases uA even if after taking it the agent is “hijacked” by some unrelated goal. This isn’t the case for any real-life utility function and environment I can think of.
it starts executing a plan with the goal of undoing any impact that the agent has done since creating device B.
This cleanup seems to be just further impact, in most cases, for the same reason there might be additional concealment incentives without intent verification. You can’t just assume a robot that turns back time for no energy cost.
Also, surely building such a powerful agent is extremely instrumentally convergent? And how is the agent making this powerful cleanup device if it has yet to seize power?
My intuition is that [causing almost all of U to take a few time steps longer to achieve] is something the agent will have to be able to do to be useful anyway
And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.
In order for that filter to be useful—you must demand that any single action the agent takes increases u_A even if after taking it the agent is “hijacked” by some unrelated goal. This isn’t the case for any real-life utility function and environment I can think of.
You might want to reread intent verification, it’s with respect to being hijacked by u_A. Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.
You can’t just assume a robot that turns back time for no energy cost.
What would prevent device B from spending that energy? By the time device B is working on the “impact undoing” (which happens only in the “long-term penalty counterfactual plans”), I don’t understand how either the long-term penalty, immediate penalty, or Intent Verification affect it.
Also, surely building such a powerful agent is extremely instrumentally convergent?
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
And how is the agent making this powerful cleanup device if it has yet to seize power?
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”? Human operators that would sense in time something is fishy?
And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.
I’m confused by this. I don’t understand how this resolves my difficulties with your argument that “building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve”. My counter argument was that this is a perfectly normal thing to happen. Do you disagree that in your Sokoban example it’s plausible that in every time step the agent’s movement makes most/all utility functions in U take either 1 step more or 1 step less to achieve their goal (not canceling out)? Why is it less likely to be able to work on the device while not wasting a time-step to too many utility function in U, compared to just working on uA in a “conventional” way?
What would prevent device B from spending that energy?
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device, and the acceptable ways of getting there? That doesn’t make sense.
Also, wasting time trying to specify these weird behaviors in the new agent is also caught by IV for the same reason ex ante is.
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”?
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
My counter argument was that this is a perfectly normal thing to happen.
My point is that ImpactUnit implies whether this is normal or not. In the gridworld, that kind of movement is normal, which is why it is the impact unit. On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Regarding all your arguments that use Intent Verification, my tentative position is that IV can’t be relied on to filter actions (as we’re still discussing under this sub-thread).
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
If I understand your argument correctly, you argue that the resources/energy device B is spending while “undoing impact” decreases the value of utility functions in U which is an additional impact that it might not be able to undo. But why wouldn’t it be able to counter that by saving enough energy/resources that would otherwise be wasted by humanity until the end of the episode? (perhaps it’s what you meant with “available energy”?).
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device? That doesn’t make sense.
I don’t claim I know how to do it myself :) But for the agent it might be as easy as cloning itself and setting some modified utility function in the new clone (done in a smart way so as to not cause too much impact in any time step).
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
As I argued above, for the agent—creating the device might be as easy as invoking a modified version of itself. In any case, I’m not sure I understand what “already powerful” means. In all the places I wrote “seizing power” I believe I should have just wrote “some convergent instrumental goal”.
On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Suppose in time step 4 the robot that creates paper-clips moves its arm 1 cm to the left. Does this impacts most utility functions in U significantly less than 1 time-step worth of utility? How about a Rumba robot that drives 1 cm forward? It depends on how you define U, but I don’t see how we can assume this issue prevents the agent from building the device (again, compare a single action while building the device to a single action while making “conventional” progress on uA: why should the former be more “wasteful” for most of U compared to the latter?).
Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.
Only in time-steps in which an action can complete the creation of a paperclip, the Intent Verificaiton filter might allow the action. Suppose that you’re in time step 11 and that creating an additional paperclip would require 5 actions: (a11,...,a15). The Intent Verificaiton filter won’t allow a11 if after doing that action a uA maximizer would switch to pursuing some convergent instrumental goal that would render a11 useless.
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
Edit to add: the following is just to illustrate what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way).
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than x. You can set x=1 and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
If you argue that the Intent Verification filter can be used to prevent the bad tricks we discussed, you need to show that you can use it to filter out the bad actions while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than… and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
No, your argument is that there are certain false positives, which I don’t contest. I even listed this kind of thing as an open question, and am interested in further discussion of how we can go about ensuring IV is properly-tuned.
You’re basically saying, “There are false positives, so that makes the core insight that allows IV to work the extent it does wrong, and unlikely to be fixable.” I disagree with this conclusion.
If you want to discuss how we could resolve or improve this issue, I’m interested. Otherwise, I don’t think continuing this conversation will be very productive.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
Well I certainly empathize with the gut reaction, that isn’t quite right.
Notice that the exact same actions had always been available before we restricted available actions to the optimal or to nothing. I think it’s possible that we could just step along the first n steps of the best plan stopping earlier in a way that lets us just get the good behavior, before any instrumental behavior is actually completed. It’s also possible that this isn’t true. This is all speculation at this point, which is why my tone in that section was also very speculative.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
I sincerely apologize, I sometimes completely fail to communicate my intention. I gave the example of the random filter only to convey what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way). I should have wrote that explicitly (edited). Sorry!
If you want to discuss how we could resolve or improve this issue, I’m interested.
Of course! I’ll think about this topic some more. I suggest we take this offline—the nesting level here has quite an impact on my browser :)
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
Fwiw, I would make the same argument that ofer did (though I haven’t read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough.
Now, obviously we know something about AUP, but It’s not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.
My present position is that it can seemingly do every task in at least one way, and we should expand the number of ways to line up with our intuitions just to be sure.
The device is activated only once (immediately after building it). The device delays the agent by 1 time step at most once, and it does so only if the agent doesn’t do the action ∅ within the 100 steps that follow the device’s activation. In practice, the agent will not do ∅. Rather, it will do the 100-step-plan, then get delayed by the device for 1 time step (for not doing ∅), and then imitatively seize control.
I’m confused about how Intent Verification would work here. We will plausibly get Qepochu(h<tat)≈Qepochu(h<t∅) for any at that does not help the agent seize control (assuming that after carrying out at the optimal plan is to still seize control). So how could this method be used to penalize building the above device?
Ah, clever! Props to you. This would indeed get past the immediate penalty. It wouldn’t get past the long-term penalty, however, since both plans would have null actions within 100 steps for 99 of the 100 actions in question.
Edit: Actually, what happens on step 101? Inaction is stalled, while action is not, so the power-grab penalty still shows up?
In any case, we can’t just say “this is small so it’s approximately the same”. Building the device is a strict waste of resources for a u_A maximizer (not to mention, it delays execution of the optimal plan by one time step). This strictly decreases Q^\text{epoch}_{u_A}, incurring 1.01 penalty and therefore being invalidated.
In time step 101 the agent will be delayed by the device. Meaning, the agent will spend that time step in an idle state, and the the action it will choose to do in that time step won’t affect the environment. In the beginning of time step 102 the agent has just seized control automatically.
I’m confused about this. The actual plan need not contain any ∅. Is the idle time step a problem? Isn’t the long-term penalty just the penalty when the maximization in the definition of Qu starts from time step t+n+1 instead of t+1? If so, I think the long-term penalty would be 0 as well (the “counterfactual plans” are identical to the “actual plan” after step 101).
If after doing any action at the optimal plan is still seizing control, then plausibly any at that does not get the agent closer to seizing control is a strict waste of resources. So this method would invalidate any plan that does not act to seize control ASAP.
No, the long-term plan involves idling until the end of the epoch after each partial plan. This clearly will not be affected by your mechanism.
No, this method invalidates actions which don’t move it closer to its (u_A) goal than doing nothing. Weird workarounds like this are precisely what motivated Intent Verification.
Ah right, thanks. Note that the above device-trick can be combined with the trick of making impact “fade” during null-action sequences (as mentioned in your post) in order to also dodge long-term penalty.
Assuming that seizing control (or any other convergent instrumental goal) ASAP is the best way to optimize uA (whatever the first action is), the Intent Verification is in practice a filter that allows an action at only if either:
(1) at is part of a seizing-control plan; or
(2) There exist a plan (at,...,at+l , [seizing-control action sequence]) with higher uA value than any (∅, [seizing-control action sequence]) plan.
How? It isn’t presently activating in the long-term case, so how would it fade? The “fading” is clearly possible for improvements, but for seizing power? This doesn’t seem to be as feasible for positions of power, considering approval incentives.
Also, building the device incurs normal penalty, too (although less than the full plan would otherwise cost, and decreases arbitrarily with action granularity) - it makes all of the U take 1 step longer to achieve, which is as bad as taking one step towards all of them.
I don’t see why those plans should be uncommon? There are sometimes more productive things to do than nothing at all.
Edit: I think this example potentially highlights an interesting separation between the premise of Thm 1, and the concept of “moving towards the goal”. This seemingly shows that each action in the plan can have a 1-step indifference, even though comparing the complete inaction plan with the complete action plan still produces the discrepancy required for Thm 1 to hold. I wonder whether we can construct a situation (even one we don’t think would pass intent verification) which fools both the immediate and long-term penalties.
I’m going to think about this more carefully later to ensure your example isn’t making a subtle bad assumption.
Suppose you build a device B that once built monitors the agent and if B detects that the agent started a null-action-sequence-until-end-of-epoch, it starts executing a plan with the goal of undoing any impact that the agent has done since creating device B.
Immediately after creating device B, and until the end of the epoch, the only long-term penalties the agent incurs are ones for impacts that device B would fail to undo (from the moment B observes the agent is in a null-action-sequence-until-end-of-epoch and until the end of the epoch).
Note: Failing to undo impact can be due to either the irreversible nature of the environment or the irreversible nature of the utility functions.
My intuition is that [causing almost all of U to take a few time steps longer to achieve] is something the agent will have to be able to do to be useful anyway (e.g. spending a few time steps worth of resources in a way that is only useful to uA). Consider your Sokoban example: it’s plausible that in every time step the agent’s movement makes most utility functions in U take either 1 step more or 1 step less to achieve (they don’t cancel each other out).
In order for that filter to be useful—you must demand that any single action the agent takes increases uA even if after taking it the agent is “hijacked” by some unrelated goal. This isn’t the case for any real-life utility function and environment I can think of.
This cleanup seems to be just further impact, in most cases, for the same reason there might be additional concealment incentives without intent verification. You can’t just assume a robot that turns back time for no energy cost.
Also, surely building such a powerful agent is extremely instrumentally convergent? And how is the agent making this powerful cleanup device if it has yet to seize power?
And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.
You might want to reread intent verification, it’s with respect to being hijacked by u_A. Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.
What would prevent device B from spending that energy? By the time device B is working on the “impact undoing” (which happens only in the “long-term penalty counterfactual plans”), I don’t understand how either the long-term penalty, immediate penalty, or Intent Verification affect it.
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”? Human operators that would sense in time something is fishy?
I’m confused by this. I don’t understand how this resolves my difficulties with your argument that “building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve”. My counter argument was that this is a perfectly normal thing to happen. Do you disagree that in your Sokoban example it’s plausible that in every time step the agent’s movement makes most/all utility functions in U take either 1 step more or 1 step less to achieve their goal (not canceling out)? Why is it less likely to be able to work on the device while not wasting a time-step to too many utility function in U, compared to just working on uA in a “conventional” way?
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device, and the acceptable ways of getting there? That doesn’t make sense.
Also, wasting time trying to specify these weird behaviors in the new agent is also caught by IV for the same reason ex ante is.
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
My point is that ImpactUnit implies whether this is normal or not. In the gridworld, that kind of movement is normal, which is why it is the impact unit. On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Regarding all your arguments that use Intent Verification, my tentative position is that IV can’t be relied on to filter actions (as we’re still discussing under this sub-thread).
If I understand your argument correctly, you argue that the resources/energy device B is spending while “undoing impact” decreases the value of utility functions in U which is an additional impact that it might not be able to undo. But why wouldn’t it be able to counter that by saving enough energy/resources that would otherwise be wasted by humanity until the end of the episode? (perhaps it’s what you meant with “available energy”?).
I don’t claim I know how to do it myself :) But for the agent it might be as easy as cloning itself and setting some modified utility function in the new clone (done in a smart way so as to not cause too much impact in any time step).
As I argued above, for the agent—creating the device might be as easy as invoking a modified version of itself. In any case, I’m not sure I understand what “already powerful” means. In all the places I wrote “seizing power” I believe I should have just wrote “some convergent instrumental goal”.
Suppose in time step 4 the robot that creates paper-clips moves its arm 1 cm to the left. Does this impacts most utility functions in U significantly less than 1 time-step worth of utility? How about a Rumba robot that drives 1 cm forward? It depends on how you define U, but I don’t see how we can assume this issue prevents the agent from building the device (again, compare a single action while building the device to a single action while making “conventional” progress on uA: why should the former be more “wasteful” for most of U compared to the latter?).
Only in time-steps in which an action can complete the creation of a paperclip, the Intent Verificaiton filter might allow the action. Suppose that you’re in time step 11 and that creating an additional paperclip would require 5 actions: (a11,...,a15). The Intent Verificaiton filter won’t allow a11 if after doing that action a uA maximizer would switch to pursuing some convergent instrumental goal that would render a11 useless.
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
Edit to add: the following is just to illustrate what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way).
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than x. You can set x=1 and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
If you argue that the Intent Verification filter can be used to prevent the bad tricks we discussed, you need to show that you can use it to filter out the bad actions while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
No, your argument is that there are certain false positives, which I don’t contest. I even listed this kind of thing as an open question, and am interested in further discussion of how we can go about ensuring IV is properly-tuned.
You’re basically saying, “There are false positives, so that makes the core insight that allows IV to work the extent it does wrong, and unlikely to be fixable.” I disagree with this conclusion.
If you want to discuss how we could resolve or improve this issue, I’m interested. Otherwise, I don’t think continuing this conversation will be very productive.
Well I certainly empathize with the gut reaction, that isn’t quite right.
Notice that the exact same actions had always been available before we restricted available actions to the optimal or to nothing. I think it’s possible that we could just step along the first n steps of the best plan stopping earlier in a way that lets us just get the good behavior, before any instrumental behavior is actually completed. It’s also possible that this isn’t true. This is all speculation at this point, which is why my tone in that section was also very speculative.
I sincerely apologize, I sometimes completely fail to communicate my intention. I gave the example of the random filter only to convey what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way). I should have wrote that explicitly (edited). Sorry!
Of course! I’ll think about this topic some more. I suggest we take this offline—the nesting level here has quite an impact on my browser :)
Fwiw, I would make the same argument that ofer did (though I haven’t read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough.
Now, obviously we know something about AUP, but It’s not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.
My present position is that it can seemingly do every task in at least one way, and we should expand the number of ways to line up with our intuitions just to be sure.