Johannes Treutlein comments on Richard Ngo’s Shortform

Johannes Treutlein 28 Mar 2023 18:29 UTC
LW: 11 AF: 8
0
AF
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it’s unclear whether that pointer is simpler than a very simple misaligned goal.

Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
- Richard_Ngo 28 Mar 2023 20:06 UTC
  LW: 3 AF: 3
  AF Parent
  So I’m imagining the agent doing reasoning like:
  
  Misaligned goal --> I should get high reward --> Behavior aligned with reward function
  
  and then I’m hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make “I should get high reward” the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.)
  I could also imagine something more like:
  
  Misaligned goal --> I should behave in aligned ways --> Aligned behavior
  
  and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option.
  Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: “I should get high reward” and “I should behave in aligned ways”, and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I’ll have a post on that topic up soon).
  - TurnTrout 3 Apr 2023 19:14 UTC
    LW: 2 AF: 2
    AF Parent
    So I’m imagining the agent doing reasoning like:
    
    Misaligned goal --> I should get high reward --> Behavior aligned with reward function
    Why would the agent reason like this?
    - Richard_Ngo 3 Apr 2023 20:11 UTC
      LW: 2 AF: 2
      AF Parent
      Because of standard deceptive alignment reasons (e.g. “I should make sure gradient descent doesn’t change my goal; I should make sure humans continue to trust me”).
      - TurnTrout 10 Apr 2023 19:06 UTC
        LW: 4 AF: 3
        AF Parent
        I think you don’t have to reason like that to avoid getting changed by SGD. Suppose I’m being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
        Maybe this is compatible with what you had in mind! It’s just not something that I think of as “high reward.”
        And maybe there’s some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust… but that feels quite contingent to me.
        Richard_Ngo 10 Apr 2023 20:35 UTC
        LW: 2 AF: 2
        AF Parent
        To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
        I think this depends sensitively on whether the “actor” and the “critic” in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that “treading water” is in fact a negative-advantage action (unless there’s some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic’s responses will depend on whether its goals are indexical or not (if they are, they’re different from the actor’s goals; if not, they’re the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent—but then the critic doesn’t need to produce a value function that’s consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.
  - SoerenMind 24 Apr 2023 9:04 UTC
    LW: 1 AF: 1
    AF Parent
    So I’m imagining the agent doing reasoning like:
    
    Misaligned goal --> I should get high reward --> Behavior aligned with reward function
    The shortest description of this thought doesn’t include “I should get high reward” because that’s already implied by having a misaligned goal and planning with it.
    In contrast, having only the goal “I should get high reward” may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.