Daniel Kokotajlo comments on Anomalous tokens reveal the original identities of Instruct models

Daniel Kokotajlo 9 Feb 2023 14:07 UTC
LW: 8 AF: 5
2
AF
Nitpick about terminology that applies not just to you but to lots of people:
Keeping all this in mind, it’s important to remind ourselves that mesaoptimizers are ultimately a form of misgeneralization. Generalization strategy being how you are going to handle novelty in the inputs. … If it’s possible, even easy, to fingerprint the shared initialization models were trained with using an out-of-distribution token just by observing the responses, then we should update towards JDP’s plan for mitigating mesaoptimizers working.
People seem to sometimes use “mesaoptimizers” as shorthand for “misaligned mesaoptimizers.” They sometimes say things like “We haven’t yet got hard empirical evidence that mesaoptimizers are a problem in practice” and “mesaoptimizers are a hypothetical problem that can occur with advanced AI systems.” All of this is misleading, IMO. If you go back to the original paper and look at the definition of a mesaoptimizer, it’s pretty clear that pretty much any AGI built using deep learning will be a mesaoptimizer and moreover ChatGPT using chain of thought is plausibly already a mesaoptimizer. The question is whether they’ll be aligned or not, i.e. whether their mesa-objectives will be the same as the ‘intended’ or ‘natural’ base objective inherent in the reward signal.
- Jozdien 10 Feb 2023 15:33 UTC
  3 points
  0
  Parent
  Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:
  ChatGPT using chain of thought is plausibly already a mesaoptimizer.
  I think simulacra are better thought of as sub-agents in relation to the original paper’s terminology than mesa-optimizers. ChatGPT doesn’t seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (depending on your definition of the term), but the fact that jailbreak methods exist to get the underlying model to adopt different simulacra seems to me to show that it’s still using the simulator mechanism. Moreover, I expect that if we get GPT-3 level models that are optimizers at the simulator-level, I think things would look very different.
- Noosphere89 10 Feb 2023 13:15 UTC
  1 point
  0
  Parent
  The critical issue is whether consequentialist mesa optimizers will arise. If consequentialist mesaoptimizers don’t arise, like in the link below, then much of the safety concern is gone.
  
  Link below:
  
  https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent#pbEciBKsk86xmcgqb
  - Daniel Kokotajlo 11 Feb 2023 15:18 UTC
    2 points
    0
    Parent
    Any agentic AGI built via deep learning will almost by definition be a consequentialist mesaoptimizer (in the broad sense of consequentialism you are talking about, I think). It’ll be performing some sort of internal search to choose actions, while also SGD or whatever the outer training loop is performs ‘search’ to update its parameters. So, boom, base optimizer and mesa optimizer.
    
    Quoting LawrenceC from that very thread:
    
    <begin quote>
    Well, no, that’s not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:
    A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system
    And the following definition of a mesa-optimizer:
    Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.
    <end quote>
    
    The “mesa” part is pretty trivial. Humans are mesaoptimizers relative to the base optimizer of evolution. If an AGI is an optimizer at all, it’s a mesaoptimizer relative to the process that built it—human R&D industry if nothing else, though given deep learning it’ll probably be gradient descent or RL.
    - Noosphere89 11 Feb 2023 15:36 UTC
      1 point
      1
      Parent
      I think the critical comment that I wanted to highlight was Nostaglebraist’s comment in that thread.