Rohin Shah comments on Arguments against myopic training

Rohin Shah 9 Jul 2020 18:10 UTC
LW: 29 AF: 17
AF
Things I agree with:
1. If humans could give correctly specified reward feedback, it is a significant handicap to have a human provide approval feedback rather than reward feedback, because that requires the human to compute the consequences of possible plans rather than offloading it to the agent.
2. If we could give perfect approval feedback, we could also provide perfect reward feedback (at least for a small action space), via your reduction.
3. Myopic training need not lead to myopic cognition (and isn’t particularly likely to for generally intelligent systems).
But I don’t think these counteract what I see as the main argument for myopic training:
While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.
(I’m using “incentivize” here to talk about outer alignment and not inner alignment.)
In other words, the point is that humans are capable of giving approval / myopic feedback (i.e. horizon = 1) with not-terrible incentives, whereas humans don’t seem capable of giving reward feedback (i.e. horizon = infinity) with not-terrible incentives. The main argument for this is that most “simple” reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that’s what the human says is correct. (Also we can just look at the long list of specification gaming examples so far.)
I’ll rephrase your objections and then respond:
Objection 1: This sacrifices competitiveness, because now the burden of predicting what action leads to good long-term effects falls to the human instead of the agent.
Response: Someone has to predict which action leads to good long-term effects, since we can’t wait for 100 years to give feedback to the agent for a single action. In a “default” training setup, we don’t want it to be the agent, because we can’t trust that the agent selects actions based on what we think is “good”. So we either need the human to take on this job (potentially with help from the agent), or we need to figure out some other way to trust that the agent selects “good” actions. Myopia / approval direction takes the first option. We don’t really know of a good way to achieve the second option.
Objection 2: This sacrifices competitiveness, because now the human can’t look at the medium-term consequences of actions before providing feedback.
This doesn’t seem to be true—if you want, you can collect a full trajectory to see the consequences of the actions, and then provide approval feedback on each of the actions individually when computing gradients.
Objection 3: There’s no difference between approval feedback and reward feedback, since perfect approval feedback can be turned into perfect reward feedback. So you might as well use the perfect reward feedback, since this is more competitive.
I agree that if you take the approval feedback that a human would give, apply this transformation, and then train a non-myopic RL agent on it, that would also not incentivize catastrophic outcomes. But if you start out with approval feedback, why would you want to do this? With approval feedback, the credit assignment problem has already been solved for the agent, whereas with the equivalent reward feedback, you’ve just undone the credit assignment and the agent now has to redo it all over again. (Like, instead of doing Q-learning, which has a non-stationary target, you could just use supervised learning to learn the fixed approval signal, surely this would be more efficient?)
On the tampering / manipulation points, I think those are special cases of the general point that it’s easier for humans to provide non-catastrophe-incentivizing approval feedback than to provide non-catastrophe-incentivizing reward feedback.
I want to reiterate that I agree with the point that myopic training probably does not lead to myopic cognition (though this depends on what exactly we mean by “myopic cognition”), and I don’t think of that as a major benefit of myopic training.
Typo:
M(s,a)−λ max a′(M(s′,a′))
I think you mean γ instead of λ
What links here?
- Richard_Ngo 9 Jul 2020 22:00 UTC
  LW: 14 AF: 10
  AF Parent
  [Objection 2] doesn’t seem to be true—if you want, you can collect a full trajectory to see the consequences of the actions, and then provide approval feedback on each of the actions individually when computing gradients.
  It feels like we have two disagreements here. One is whether the thing you describe in this quote is “myopic” training. If you think that the core idea of myopia is that the evaluation of an action isn’t based on its effects, then this is better described as nonmyopic. But if you think that the core idea of myopia is that the agent doesn’t do its own credit assignment, then this is best described as myopic.
  If you think, as I interpret you as saying, that the main reason myopia is useful is because it removes the incentive for agents to steer towards incorrectly high-reward states (which I’ll call “manipulative” states), then you should be inclined towards the first definition. Because the approach you described above (of collecting and evaluating a full trajectory before giving feedback) means the agent still has an incentive to do multi-step manipulative plans.
  More specifically: if a myopic agent’s actions A_1 to A_n manipulate the supervisor into thinking that the N+1th state is really amazing, and the supervisor looks at the full trajectory before assigning approval, then the supervisor will give higher approval to all of the actions A_1 to A_n, and they’ll all be reinforced, which is the same thing as would happen in a nonmyopic setup if the supervisor just gave the Nth action really high reward. In other words, it doesn’t matter if the agent is doing its own credit assignment because the supervisor is basically doing the same credit assignment as the agent would. So if you count the approach you described above as myopic, then myopia doesn’t do the thing you claim it does.
  (I guess you could say that something count as a “small” error if it only affects a few states, and so what I just described is not a small error in the approval function? But it’s a small error in the *process* of generating the approval function, which is the important thing. In general I don’t think counting the size of an error in terms of the number of states affected makes much sense, since you can always arbitrarily change those numbers.)
  The second disagreement is about:
  Most “simple” reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that’s what the human says is correct.
  I am kinda confused about what sort of approval feedback you’re talking about. Suppose we have a simple reward function, which gives the agent more points for collecting more berries. Then the agent has lots of convergent instrumental subgoals. Okay, what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).
  I picture you saying that the latter is not very simple, because it needs to make all these predictions about complex dependencies on future states. But that’s what you need in any approval function that you want to use to train a competent agent. It seems like you’re only picturing myopic feedback that doesn’t actually solve the problem of figuring out which actions lead to which states—but as soon as you do, you get the same issues. It is no virtue of approval functions that most of them are safe, if none of the safe ones specify the behaviour we actually want from AIs.
  - Rohin Shah 10 Jul 2020 6:53 UTC
    LW: 9 AF: 7
    AF Parent
    First disagreement:
    the main reason myopia is useful is because it removes the incentive for agents to steer towards incorrectly high-reward states (which I’ll call “manipulative” states)
    … There’s a lot of ways that reward functions go wrong besides manipulation. I agree that if what you’re worried about is manipulation in N actions, then you shouldn’t let the trajectory go on for N actions before evaluating.
    Consider the boat racing example. I’m saying that we wouldn’t have had the boat going around in circles if we had used approval feedback, because the human wouldn’t have approved of the actions where the boat goes around in a circle.
    (You might argue that if a human had been giving the reward signal, instead of having an automated reward function, that also would have avoided the bad behavior. I basically agree with that, but then my point would just be that humans are better at providing approval feedback than reward feedback—we just aren’t very used to thinking in terms of “rewards”. See the COACH paper.)
    Second disagreement:
    Okay, what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).
    When the supervisor sees the agent trying to take over the world in order to collect more berries, the supervisor disapproves, and the agent stops taking that action. (I suspect this ends up being the same disagreement as the first one, where you’d say “but the supervisor can do that with rewards too”, and I say “sure, but humans are better at giving approval feedback than reward feedback”.)
    Again, I do agree with you that myopic training is not particularly likely to lead to myopic cognition. It seems to me like this is creeping into your arguments somewhere, but I may be wrong about that.
    - Richard_Ngo 10 Jul 2020 9:46 UTC
      LW: 10 AF: 7
      AF Parent
      There’s a lot of ways that reward functions go wrong besides manipulation.
      I’m calling them manipulative states because if the human notices that the reward function has gone wrong, they’ll just change the reward they’re giving. So there must be something that stops them from noticing this. But maybe it’s a misleading term, and this isn’t an important point, so for now I’ll use “incorrectly rewarded states” instead.
      I agree that if what you’re worried about is manipulation in N actions, then you shouldn’t let the trajectory go on for N actions before evaluating.
      This isn’t quite my argument. My two arguments are:
      1. IF an important reason you care about myopia is to prevent agents from making N-step plans to get to incorrectly rewarded states, THEN you can’t defend the competitiveness of myopia by saying that we’ll just look at the whole trajectory (as you did in your original reply).
      2. However, even myopically cutting off the trajectory before the agent takes N actions is insufficient to prevent the agent from making N-step plans to get to incorrectly rewarded states.
      Sure, but humans are better at giving approval feedback than reward feedback. … we just aren’t very used to thinking in terms of “rewards”.
      Has this argument been written up anywhere? I think I kinda get what you mean by “better”, but even if that’s true, I don’t know how to think about what the implications are. Also, I think it’s false if we condition on the myopic agents actually being competitive.
      My guess is that this disagreement is based on you thinking primarily about tasks where it’s clear what we want the agent to do, and we just need to push it in that direction (like the ones discussed in the COACH paper). I agree that approval feedback is much more natural for this use case. But when I’m talking about competitive AGI, I’m talking about agents that can figure out novel approaches and strategies. Coming up with reward feedback that works for that is much easier than coming up with workable approval feedback, because we just don’t know the values of different actions. If we do manage to train competitive myopic agents, I expect that the way we calculate the approval function is by looking at the action, predicting what outcomes it will lead to, and evaluating how good those outcomes are—which is basically just mentally calculating a reward function and converting it to a value function. But then we could just skip the “predicting” bit and actually look at the outcomes instead—i.e. making it nonmyopic.
      If you have ideas for how we might supervise complex tasks like Go to a superhuman level, without assigning values to outcomes in a way that falls into the same traps as reward-based learning, or without benefiting greatly from looking at what the actual consequences are, then that would constitute a compelling argument against my position. E.g. maybe we can figure out what “good cognitive steps” are, and then reward the agent for doing those without bothering to figure out what outcomes good cognitive steps will lead to. That seems very hard, but it’s the sort of thing I think you need to defend if you’re going to defend myopia. (I expect Debate about which actions to take, for instance, to benefit greatly from the judge being able to refer to later outcomes of actions).
      Another way of making this argument: humans very much think in terms of outcomes, and how good those outcomes are, by default. I agree that we are bad at giving step-by-step dense rewards. But the whole point of a reward function is that you don’t need to do the step-by-step thing, you can mostly just focus on rewarding good outcomes, and the agent does the credit assignment itself. I picture you arguing that we’ll need shaped rewards to help the agent explore, but a) we can get rid of those shaped rewards as soon as the agent has gotten off the ground, so that they don’t affect long-term incentives, and b) even shaped rewards can still be quite outcome-focused (and therefore natural to think about) - e.g. +1 for killing Roshan in League of Legends.
      In terms of catching and correcting mistakes in the specification, I agree that myopia forces the supervisor to keep watching the agent, which means that the supervisor is more likely to notice if they’ve accidentally incentivised the agent to do something bad. But whatever bad behaviour the supervisor is able to notice during myopic training, they could also notice during nonmyopic training if they were watching carefully. So perhaps myopia is useful as a commitment device to force supervisors to pay attention, but given the huge cost of calculating the likely outcomes of all actions, I doubt anyone will want to use it that way.
      - Rohin Shah 10 Jul 2020 18:10 UTC
        LW: 11 AF: 8
        AF Parent
        I can’t speak for everyone else, but when I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given. If you would wait till the end of the trajectory before giving rewards in regular RL, then you would wait till the end of the trajectory before giving approval in myopic training.
        If you have ideas for how we might supervise complex tasks like Go to a superhuman level, without assigning values to outcomes in a way that falls into the same traps as reward-based learning
        … Iterated amplification? Debate?
        The point of these methods is to have an overseer that is more powerful than the agent being trained, so that you never have to achieve super-overseer performance (but you do achieve superhuman performance). In debate, you can think of judge + agent 1 as the overseer for agent 2, and judge + agent 2 as the overseer for agent 1.
        (You don’t use the overseer itself as your ML system, because the overseer is slow while the agent is fast.)
        I agree that if you’re hoping to get an agent that is more powerful than its overseer, then you’re counting on some form of generalization / transfer, and you shouldn’t expect myopic training to be much better (if at all) than regular RL at getting the “right” generalization.
        Has this argument been written up anywhere?
        Approval-directed agents. Note a counterargument in against mimicry (technically argues against imitation, but I think it also applies to approval).
        But when I’m talking about competitive AGI, I’m talking about agents that can figure out novel approaches and strategies.
        See above about being superhuman but sub-overseer. (Note that the agents can still come up with novel approaches and strategies that the overseer would have come up with, even if the overseer did not actually come up with them.)
        humans very much think in terms of outcomes, and how good those outcomes are, by default.
        … This does not match my experience at all. Most of the time it seems to me that we’re executing habits and heuristics that we’ve learned over time, and only when we need to think about something novel do we start trying to predict consequences and rate how good they are in order to come to a conclusion. (E.g. most people intuitively reject the notion that we should kill one person for organs to save 5 lives. I don’t think they are usually predicting outcomes and then figuring out whether those outcomes are good or not.)
        I picture you arguing that we’ll need shaped rewards to help the agent explore,
        I mean, yes, but I don’t think it’s particularly relevant to this disagreement.
        TL;DR: I think our main disagreement is whether humans can give approval feedback in any way other than estimating how good the consequences of the action are (both observed and predicted in the future). I agree that if we are trying to have an overseer train a more intelligent agent, it seems likely that you’d have to focus on how good the consequences are. However, I think we will plausibly have the overseer be more intelligent than the agent, and so I expect that the overseer can provide feedback in other ways as well.
        Richard_Ngo 11 Jul 2020 9:59 UTC
        LW: 8 AF: 6
        AF Parent
        I broadly agree about what our main disagreement is. Note that I’ve been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what’s really going on during amplification, which is a bigger topic that I’ll need to think about more.
        On the side disagreement (of whether looking at future states before evaluation counts as “myopic”) I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I’ve added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:
        Objection 2: This sacrifices competitiveness, because now the human can’t look at the medium-term consequences of actions before providing feedback.
        Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn’t matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can’t do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.
        E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.
        This is consistent with your position: “When I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given”. However, it also raises the question of why we can’t just wait until the end of the trajectory to give myopic feedback anyway. In my edits I’ve called this “semi-myopia”. This wouldn’t be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.
        Rohin Shah 11 Jul 2020 19:23 UTC
        LW: 6 AF: 4
        AF Parent
        Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training.
        I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don’t do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you’ve collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn’t change at all (though their implementation would).
  - TurnTrout 8 Dec 2020 23:16 UTC
    LW: 2 AF: 1
    AF Parent
    Most “simple” reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that’s what the human says is correct.
    I am kinda confused about what sort of approval feedback you’re talking about. Suppose we have a simple reward function, which gives the agent more points for collecting more berries. Then the agent has lots of convergent instrumental subgoals. Okay, what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).
    I think there are secretly two notions of “leads to convergent instrumental subgoals” here. There’s the outer-alignment notion: “do any of the optimal policies for this reward function pursue convergent instrumental subgoals?”, and also the “given a fixed learning setup, is it probable that training on this reward signal induces learned policies which pursue convergent instrumental subgoals?”.
    That said,
    I think that ~any non-trivial reward signal we know how to train policy networks on right now, plausibly leads to learned convergent instrumental goals. I think that this may include myopically trained RL agents, but I’m not really sure yet.
    I think that the optimal policies for an approval reward signal are unlikely to have convergent instrumental subgoals. But this is weird, because Optimal Policies Tend to Seek Power gives conditions under which, well… optimal policies tend to seek power: given that we have a “neutral” (IID) prior over reward functions and that the agent acts optimally, we should expect to see power-seeking behavior in such-and-such situations.
    This generally becomes more pronounced as the discount rate increases towards 1, which is a hint that there’s something going on with “myopic cognition” in MDPs, and that discount rates do matter in some sense.
    And then you stop thinking about IID reward functions, you realize that we still don’t know how to write down non-myopic, non-trivial reward functions that wouldn’t lead to doom if maximized AIXI-style. We don’t know how to do that, AFAICT, though not for lack of trying.
    
    But with respect to the kind of approval function we’d likely implement, optimally arg-max’ing the approval function doesn’t seem to induce the same kinds of subgoals. AFAICT, it might lead to short-term manipulative behavior, but not long-term power-seeking.
    Agents won’t be acting optimally, but this distinction hints that we know how to implement different kinds of things via approval than we do via rewards. There’s something different about the respective reward function sets. I think this difference is interesting.
    What links here?
    TurnTrout's comment on Distinguishing claims about training vs deployment by Richard_Ngo (4 Feb 2021 0:00 UTC; 4 points)
  - TurnTrout 9 Jul 2020 22:30 UTC
    LW: 2 AF: 1
    AF Parent
    what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).
    Sure. But, consider maximizing “TurnTrout has a fun day”-reward (for some imperfect grounding of that concept), and maximizing my approval of actions based on whether i think they’ll lead to a fun adventure.
    The former takes over the world, and I don’t have a very fun day. but what about the latter?
    To some extent, I won’t approve of actions that cause the agent to break, so there will be at least some instrumental subgoal pursuit for the agent. But for a successful power-seeking policy to be optimal, there is a conjunctive burden — we aren’t maximizing long-term discounted reward anymore, and the actions are evaluated locally, independently of any explicit global reward signal.
    Many quasi-independently predicted approval judgments must cohere into a dangerous policy. It’s quite possible that this happens, but I’m not very convinced of that right now.
    - Richard_Ngo 10 Jul 2020 0:36 UTC
      LW: 4 AF: 2
      AF Parent
      “Many quasi-independently predicted approval judgments must cohere into a dangerous policy.”
      
      I described how this happens in the section on manipulating humans. In short, there is no “quasi-independence” because you are still evaluating every action based on whether you think it’ll lead to a fun adventure. This is exactly analogous to why the reward function you described takes over the world.
      - TurnTrout 10 Jul 2020 1:09 UTC
        LW: 2 AF: 1
        AF Parent
        I described how this happens in the section on manipulating humans
        Yes, but I don’t understand your case for “finding chains of manipulative inputs which increase myopic reward” entailing power-seeking? Why would that behavior, in particular, lead to the highest myopic reward? If we didn’t already know about power-seeking reward maximizers, why would we promote this hypothesis to attention?
        This is exactly analogous to why the reward function you described takes over the world.
        I disagree? Those objectives seem qualitatively dissimilar.
        Richard_Ngo 10 Jul 2020 6:23 UTC
        LW: 2 AF: 1
        AF Parent
        “Why would that behavior, in particular, lead to the highest myopic reward?”
        
        I addressed this in my original comment: “More specifically: if a myopic agent’s actions A_1 to A_n manipulate the supervisor into thinking that the N+1th state is really amazing, and the supervisor looks at the full trajectory before assigning approval, then the supervisor will give higher approval to all of the actions A_1 to A_n, and they’ll all be reinforced, which is the same thing as would happen in a nonmyopic setup if the supervisor just gave the Nth action really high reward.”
        
        TurnTrout 10 Jul 2020 12:12 UTC
        LW: 2 AF: 1
        AF Parent
        That’s not what I’m asking. Why would that lead to power-seeking? You seem to be identifying “manipulation” with “power-seeking”; power-seeking implies manipulation, but the converse isn’t always true.
        Richard_Ngo 10 Jul 2020 16:33 UTC
        LW: 2 AF: 1
        AF Parent
        Why do nonmyopic agents end up power-seeking? Because the supervisor rates some states highly, and so the agent is incentivised to gain power in order to reach those states.
        Why do myopic agents end up power-seeking? Because to train a competitive myopic agent, the supervisor will need to calculate how much approval they assign to actions based on how much those actions contribute to reaching valuable states. So the agent will be rewarded for taking actions which acquire it more power, since the supervisor will predict that those contribute to reaching valuable states.
        (You might argue that, if the supervisor doesn’t want the agent to be power-seeking, they’ll only approve of actions which gain the agent more power in specified ways. But equivalently a reward function can also penalise unauthorised power-gaining, given equal ability to notice it by the supervisors in both cases.)
        TurnTrout 23 Jul 2020 23:09 UTC
        LW: 3 AF: 2
        AF Parent
        I now think that I was thinking of myopic cognition, whereas you are talking about myopic training. Oops! This is obvious in hindsight (and now I’m wondering how I missed it), but maybe you could edit the post to draw a clear contrast?
        Richard_Ngo 24 Jul 2020 6:07 UTC
        LW: 2 AF: 1
        AF Parent
        Ah, makes sense. There’s already a paragraph on this (starting “I should note that so far”), but I’ll edit to mention it earlier.
        TurnTrout 10 Jul 2020 17:38 UTC
        LW: 2 AF: 1
        AF Parent
        But equivalently a reward function can also penalise unauthorised power-gaining, given equal ability to notice it by the supervisors in both cases.
        This is likely the crux of our disagreement, but I don’t have time to reply ATM. Hope to return to this.
    - Charlie Steiner 14 Jul 2020 20:48 UTC
      LW: 2 AF: 1
      AF Parent
      If the latter is implemented on Laplace’s Demon, and simply looks through all actions and picks the one with the highest approval, then I think it depends on how you’ve defined “approval.” If maximum approval could be bad (e.g. if approval is unbounded, or if it would take a lot of work to find a benign context where you always give it maximum approval), then this search process is searching for things that look like taking over the world.
      But as we move away from Laplace’s demon, then I agree that realistic solutions look more like only manipulating TurnTrout and his immediate spatiotemporal surroundings.
- Evan R. Murphy 14 Apr 2022 21:55 UTC
  3 points
  AF Parent
  While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.
  I think this is a really important point, thanks.
  Objection 3: There’s no difference between approval feedback and myopic feedback, since perfect approval feedback can be turned into perfect reward feedback. So you might as well use the perfect reward feedback, since this is more competitive.
  Did you mean “There’s no difference between approval feedback and reward feedback”?
  - Rohin Shah 15 Apr 2022 8:40 UTC
    2 points
    Parent
    Yes, fixed, thanks.
- TurnTrout 9 Jul 2020 18:38 UTC
  LW: 2 AF: 1
  AF Parent
  The main argument for this is that most “simple” reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that’s what the human says is correct. (Also we can just look at the long list of specification gaming examples so far.)
  +1, I was about to write an argument to this effect.
  Also, you can’t always rationalize $M$ as state-based reward maximization, but even if you could, that doesn’t tell you much. Taken on its own, the argument about $M$ -equivalence proves too much, because it would imply random policies have convergent instrumental subgoals:
  Let $M (s, a)$ be uniformly randomly drawn from the unit interval, the first time it’s called. Have the agent choose the argmax for its policy. This can be rationalized as some $R (s, a, s^{'})$ maximization, so it’s probably power-seeking.
  This doesn’t hold, obviously. Any argument about approval maximization should use specific facts about how approval is computed.
  Put otherwise, specifying an actual reward function seems to be a good way to get a catastrophic maximizer, but arbitrary action-scoring rules don’t seem to have this property, as Rohin said above. Most reward functions have power-seeking optimal policies, and every policy is optimal for some reward function, but most policies aren’t power-seeking.