So I just realized that intent verification also applies 1.01 penalty to all 100 of the steps for not increasing the immediate Q^\text{epoch}_{u_A}.
You could get around this by saying, well, the device only delays the agent 99% of the time. The immediate penalty has therefore not disappeared, but is rather made arbitrarily small: for delay probability p, the penalty is now ((num steps in takeover plan)•(1-p) + p)•takeover_step_penalty, with the second p being due to the normal penalty for building the device. The immediate penalty can become arbitrarily small in the limit of action granularity and p → 1.
[Note for those who didn’t read the other chain: building the device itself wouldn’t be allowed by intent verification, and it also doesn’t seem to dupe the long-term penalty. Still an extremely interesting attempted workaround.]
I think Intent Verification can’t be relied on for filtering out actions because it will plausibly filter out the “good” actions (actions for useful and safe plans) in all but specific time steps. See my argument here.
I think the agent might be able to dodge some unknown fraction of the long-term penalty with a trick I described here (I now added under that comment an additional reply addressing your arguments that I didn’t previously address—sorry for missing them).
So I just realized that intent verification also applies 1.01 penalty to all 100 of the steps for not increasing the immediate Q^\text{epoch}_{u_A}.
You could get around this by saying, well, the device only delays the agent 99% of the time. The immediate penalty has therefore not disappeared, but is rather made arbitrarily small: for delay probability p, the penalty is now ((num steps in takeover plan)•(1-p) + p)•takeover_step_penalty, with the second p being due to the normal penalty for building the device. The immediate penalty can become arbitrarily small in the limit of action granularity and p → 1.
[Note for those who didn’t read the other chain: building the device itself wouldn’t be allowed by intent verification, and it also doesn’t seem to dupe the long-term penalty. Still an extremely interesting attempted workaround.]
To recap my position:
I think Intent Verification can’t be relied on for filtering out actions because it will plausibly filter out the “good” actions (actions for useful and safe plans) in all but specific time steps. See my argument here.
I think the agent might be able to dodge some unknown fraction of the long-term penalty with a trick I described here (I now added under that comment an additional reply addressing your arguments that I didn’t previously address—sorry for missing them).