Wei Dai comments on Conditions for Mesa-Optimization

Wei Dai Sep 17, 2019, 11:06 PM
LW: 4 AF: 4
AF

I meant that claim to apply to “realistic” tasks (which I don’t yet know how to define).

Machine learning seems hard to do without search, if that counts as a “realistic” task. :)

I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if “just heuristics” is enough for arbitrary levels of performance on “realistic” tasks?
- Rohin Shah Sep 18, 2019, 3:55 PM
  LW: 3 AF: 2
  AF Parent
  Machine learning seems hard to do without search, if that counts as a “realistic” task. :)
  Humans and systems produced by meta learning both do reasonably well at learning, and don’t do “search” (depending on how loose you are with your definition of “search”).
  I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if “just heuristics” is enough for arbitrary levels of performance on “realistic” tasks?
  It’s plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven’t had much experience on before, but nonetheless because they aren’t made up of a two clean parts (mesa objective + capabilities) there isn’t a single obvious mesa objective that the AI system is optimizing for off distribution. I’m not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.
  Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don’t seem like “search” to me, though they also aren’t just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.
  (Also, reading back through this comment thread, I’m no longer sure whether or not a neural net could learn to play at least the 1-player random version of the SHA game. Certainly in the limit it can just memorize the input-output table, but I wouldn’t be surprised if it could get some accuracy even without that.)
  - Wei Dai Sep 19, 2019, 4:42 PM
    LW: 9 AF: 7
    AF Parent
    
    It’s plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven’t had much experience on before, but nonetheless because they aren’t made up of a two clean parts (mesa objective + capabilities) there isn’t a single obvious mesa objective that the AI system is optimizing for off distribution.
    
    Coming back to this, can you give an example of the kind of thing you’re thinking of (in humans, animals, current ML systems)? Or other reason you think this could be the case in the future?
    
    Also, do you think this will be significantly more efficient than “two clean parts (mesa objective + capabilities)”? (If not, it seems like we can use inner alignment techniques, e.g., transparency and verification, to force the model to be “two clean parts” if that’s better for safety.)
    - Rohin Shah Sep 19, 2019, 8:21 PM
      LW: 3 AF: 2
      AF Parent
      Coming back to this, can you give an example of the kind of thing you’re thinking of (in humans, animals, current ML systems)?
      Humans don’t seem to have one mesa objective that we’re optimizing for. Even in this community, we tend to be uncertain about what our actual goal is, and most other people don’t even think about it. Humans do lots of things that look like “changing their objective”, e.g. maybe someone initially wants to have a family but then realizes they want to devote their life to public service because it’s more fulfilling.
      Also, do you think this will be significantly more efficient than “two clean parts (mesa objective + capabilities)”?
      I suspect it would be more efficient, but I’m not sure. (Mostly this is because humans and animals don’t seem to have two clean parts, but quite plausibly we’ll do something more interpretable than evolution and that will push towards a clean separation.) I also don’t know whether it would be better for safety to have it split into two clean parts.
      - Wei Dai Sep 19, 2019, 9:41 PM
        LW: 4 AF: 3
        AF Parent
        
        Humans do lots of things that look like “changing their objective” [...]
        
        That’s true but unless the AI is doing something like human imitation or metaphilosophy (in other words, we have some reason to think that the AI will converge to the “right” values), it seems dangerous to let it “changing their objective” on its own. Unless, I guess, it’s doing something like mild optimization or following norms, so that it can’t do much damage even if it switches to a wrong objective, and we can just shut it down and start over. But if it’s as messy as humans are, how would we know that it’s strictly following norms or doing mild optimization, and won’t “change its mind” about that too at some point (kind of like a human who isn’t very strategic suddenly has an insight or reads something on the Internet and decides to become strategic)?
        
        I think overall I’m still confused about your perspective here. Do you think this kind of “messy” AI is something we should try to harness and turn into a safety success story (if so how), or do you think it’s a danger that we should try to avoid (which may for example have to involve global coordination because it might be more efficient than safer AIs that do have clean separation)?
        
        Oh, going back to an earlier comment, I guess you’re suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.
        
        Rohin Shah Sep 20, 2019, 5:29 PM
        LW: 3 AF: 2
        AF Parent
        In this entire comment thread I’m not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I’m simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,
        I’m not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.
        It’s very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they “change their objective”. It’s also plausible that they’re safer, e.g. because they are full-blown explicit EU maximizers and we can “convince” them to adopt goals similar to ours.
        Mostly I’m saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.
  - Wei Dai Sep 18, 2019, 6:29 PM
    LW: 5 AF: 4
    AF Parent
    
    Humans and systems produced by meta learning both do reasonably well at learning, and don’t do “search” (depending on how loose you are with your definition of “search”).
    
    Part of what inspired me to write my comment was watching my kid play logic puzzles. When she starts a new game, she has to do a lot of random trial-and-error with backtracking, much like MCTS. (She does the trial-and-error on the physical game board, but when I play I often just do it in my head.) Then her intuition builds up and she can start to recognize solutions earlier and earlier in the search tree, sometimes even immediately upon starting a new puzzle level. Then the game gets harder (the puzzle levels slowly increase in difficulty) or moves to a new regime where her intuitions don’t work, and she has to do more trial-and-error again, and so on. This sure seems like “search” to me.
    
    Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don’t seem like “search” to me, though they also aren’t just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.
    
    This really confuses me. Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that’s search or not), but with RL, surely the “explore” steps have to count as “search”? Do you have a different kind of thing in mind when you think of “search”?
    - Rohin Shah Sep 19, 2019, 6:29 AM
      LW: 3 AF: 2
      AF Parent
      This sure seems like “search” to me.
      I agree that if you have a model of the system (as you do when you know the rules of the game), you can simulate potential actions and consequences, and that seems like search.
      Usually, you don’t have a good model of the system, and then you need something else.
      Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that’s search or not), but with RL, surely the “explore” steps have to count as “search”?
      I was thinking of following a gradient in supervised learning.
      I agree that pure reinforcement learning with a sparse reward looks like search. I doubt that pure RL with sparse reward is going to get you very far.
      Reinforcement learning with demonstrations or a very dense reward doesn’t really look like search, it looks more like someone telling you what to do and you following the instructions faithfully.