Random Developer comments on ChatGPT can learn indirect control

Random Developer 23 Mar 2024 11:43 UTC
12 points
9
One thing we know about these models is that they’re good at interpolating within their training data, and that they have seen enormous amounts of training data. But they’re weak outside those large training sets. They have a very different set of strengths and weaknesses than humans.

And yet… I’m not 100% convinced that this matters. If these models have seen a thousand instances of self-reflection (or mirror test awareness, or whatever), and if they can use those examples to generalize to other forms of self-awareness, then might that still give them very rudimentary ability to pass the mirror test?

I’m not sure that I’m explaining this well—the key question here is “does generalizing over enough examples of passing the ‘mirror test’ actually teach the models some rudimentary (unconscious) self-awareness?” Or maybe, “Will the model fake until it makes it?” I could not confidently answer either way.
- Kaj_Sotala 24 Mar 2024 18:03 UTC
  11 points
  0
  Parent
  Come to think of it, how is it that humans pass the mirror test? There’s probably a lot of existing theorizing on this, but a quick guess without having read any of it: babies first spend a long time learning to control their body, and then learn an implicit rule like “if I can control it by an act of will, it is me”, getting a lot of training data that reinforces that rule. Then they see themselves in a mirror and notice that they can control their reflection through an act of will...
  This is an incomplete answer since it doesn’t explain how they learn to understand that the entity in the mirror is not a part of their actual body, but it does somewhat suggest that maybe humans just interpolate their self-awareness from a bunch of training data too.
  - wassname 25 Mar 2024 5:54 UTC
    4 points
    0
    Parent
    
    learn an implicit rule like “if I can control it by an act of will, it is me
    
    This was empirically demonstrated to be possible in this paper: “Curiosity-driven Exploration by Self-supervised Prediction”, Pathak et al
    
    We formulate curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model.
    
    It probably could be extended to learn “other” and the “boundary between self and other” in a similar way.
    
    I implemented a version of it myself and it worked. This was years ago. I can only imagine what will happen when someone redoes some of these old RL algo’s, with LLM’s providing the world model.
    - wassname 9 Jun 2024 6:34 UTC
      1 point
      0
      Parent
      Also DEIR needs to implicitly distinguish between things it caused, and things it didn’t https://arxiv.org/abs/2304.10770