I think Intent Verification can’t be relied on for filtering out actions because it will plausibly filter out the “good” actions (actions for useful and safe plans) in all but specific time steps. See my argument here.
I think the agent might be able to dodge some unknown fraction of the long-term penalty with a trick I described here (I now added under that comment an additional reply addressing your arguments that I didn’t previously address—sorry for missing them).
To recap my position:
I think Intent Verification can’t be relied on for filtering out actions because it will plausibly filter out the “good” actions (actions for useful and safe plans) in all but specific time steps. See my argument here.
I think the agent might be able to dodge some unknown fraction of the long-term penalty with a trick I described here (I now added under that comment an additional reply addressing your arguments that I didn’t previously address—sorry for missing them).