You can’t just assume a robot that turns back time for no energy cost.
What would prevent device B from spending that energy? By the time device B is working on the “impact undoing” (which happens only in the “long-term penalty counterfactual plans”), I don’t understand how either the long-term penalty, immediate penalty, or Intent Verification affect it.
Also, surely building such a powerful agent is extremely instrumentally convergent?
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
And how is the agent making this powerful cleanup device if it has yet to seize power?
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”? Human operators that would sense in time something is fishy?
And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.
I’m confused by this. I don’t understand how this resolves my difficulties with your argument that “building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve”. My counter argument was that this is a perfectly normal thing to happen. Do you disagree that in your Sokoban example it’s plausible that in every time step the agent’s movement makes most/all utility functions in U take either 1 step more or 1 step less to achieve their goal (not canceling out)? Why is it less likely to be able to work on the device while not wasting a time-step to too many utility function in U, compared to just working on uA in a “conventional” way?
What would prevent device B from spending that energy?
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device, and the acceptable ways of getting there? That doesn’t make sense.
Also, wasting time trying to specify these weird behaviors in the new agent is also caught by IV for the same reason ex ante is.
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”?
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
My counter argument was that this is a perfectly normal thing to happen.
My point is that ImpactUnit implies whether this is normal or not. In the gridworld, that kind of movement is normal, which is why it is the impact unit. On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Regarding all your arguments that use Intent Verification, my tentative position is that IV can’t be relied on to filter actions (as we’re still discussing under this sub-thread).
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
If I understand your argument correctly, you argue that the resources/energy device B is spending while “undoing impact” decreases the value of utility functions in U which is an additional impact that it might not be able to undo. But why wouldn’t it be able to counter that by saving enough energy/resources that would otherwise be wasted by humanity until the end of the episode? (perhaps it’s what you meant with “available energy”?).
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device? That doesn’t make sense.
I don’t claim I know how to do it myself :) But for the agent it might be as easy as cloning itself and setting some modified utility function in the new clone (done in a smart way so as to not cause too much impact in any time step).
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
As I argued above, for the agent—creating the device might be as easy as invoking a modified version of itself. In any case, I’m not sure I understand what “already powerful” means. In all the places I wrote “seizing power” I believe I should have just wrote “some convergent instrumental goal”.
On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Suppose in time step 4 the robot that creates paper-clips moves its arm 1 cm to the left. Does this impacts most utility functions in U significantly less than 1 time-step worth of utility? How about a Rumba robot that drives 1 cm forward? It depends on how you define U, but I don’t see how we can assume this issue prevents the agent from building the device (again, compare a single action while building the device to a single action while making “conventional” progress on uA: why should the former be more “wasteful” for most of U compared to the latter?).
What would prevent device B from spending that energy? By the time device B is working on the “impact undoing” (which happens only in the “long-term penalty counterfactual plans”), I don’t understand how either the long-term penalty, immediate penalty, or Intent Verification affect it.
Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than “intended” (including during its construction).
I’m not sure I understood the question. What would prevent the agent from constructing this device “before seizing power”? Human operators that would sense in time something is fishy?
I’m confused by this. I don’t understand how this resolves my difficulties with your argument that “building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve”. My counter argument was that this is a perfectly normal thing to happen. Do you disagree that in your Sokoban example it’s plausible that in every time step the agent’s movement makes most/all utility functions in U take either 1 step more or 1 step less to achieve their goal (not canceling out)? Why is it less likely to be able to work on the device while not wasting a time-step to too many utility function in U, compared to just working on uA in a “conventional” way?
Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.
So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device, and the acceptable ways of getting there? That doesn’t make sense.
Also, wasting time trying to specify these weird behaviors in the new agent is also caught by IV for the same reason ex ante is.
You said the agent has to seize power over 100 steps, but it can also make a singleton that will “revert” impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.
My point is that ImpactUnit implies whether this is normal or not. In the gridworld, that kind of movement is normal, which is why it is the impact unit. On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.
Regarding all your arguments that use Intent Verification, my tentative position is that IV can’t be relied on to filter actions (as we’re still discussing under this sub-thread).
If I understand your argument correctly, you argue that the resources/energy device B is spending while “undoing impact” decreases the value of utility functions in U which is an additional impact that it might not be able to undo. But why wouldn’t it be able to counter that by saving enough energy/resources that would otherwise be wasted by humanity until the end of the episode? (perhaps it’s what you meant with “available energy”?).
I don’t claim I know how to do it myself :) But for the agent it might be as easy as cloning itself and setting some modified utility function in the new clone (done in a smart way so as to not cause too much impact in any time step).
As I argued above, for the agent—creating the device might be as easy as invoking a modified version of itself. In any case, I’m not sure I understand what “already powerful” means. In all the places I wrote “seizing power” I believe I should have just wrote “some convergent instrumental goal”.
Suppose in time step 4 the robot that creates paper-clips moves its arm 1 cm to the left. Does this impacts most utility functions in U significantly less than 1 time-step worth of utility? How about a Rumba robot that drives 1 cm forward? It depends on how you define U, but I don’t see how we can assume this issue prevents the agent from building the device (again, compare a single action while building the device to a single action while making “conventional” progress on uA: why should the former be more “wasteful” for most of U compared to the latter?).