Gunnar_Zarncke comments on Barriers to Mechanistic Interpretability for AGI Safety

Gunnar_Zarncke 29 Aug 2023 17:12 UTC
7 points
1
AGI systems interact with their environment, learn online and will externalize massive parts of their cognition into the environment.
This comment is not about interpretability but a generalization of the question.
What is the AGI system and what is the environment? Where does the AGI system draw the boundary when reasoning about itself?
For humans, there is a clearer agent—environment distinction because we have bodies with a relatively clear physical boundary (though some people might already see their body as part of the environment and only count their brain or even their mind, however delineated). For AGI systems it is less clear. Is it the running software, the computers, the whole compute center, or even the organization keeping the machines running?
- Connor Leahy 30 Aug 2023 8:07 UTC
  6 points
  0
  Parent
  Yep, you see the problem! It’s tempting to just think of an AI as “just the model”, and study that in isolation, but that just won’t be good enough longterm.
  - mesaoptimizer 7 Sep 2023 13:25 UTC
    2 points
    −1
    Parent
    I see—you are implying that an AI model will leverage external system parts to augment itself. For example, a neural network would use an external scratch-pad as a different form of memory for itself. Or instantiate a clone of itself to do a certain task for it. Or perhaps use some sort of scaffolding.
    
    I think these concerns probably don’t matter for an AGI, because I expect that data transfer latency would be a non-trivial blocker for storing data outside the model itself, and it is more efficient to to self-modify and improve one’s own intelligence than to use some form of ‘factored cognition’. Perhaps these things are issues for an ostensibly boxed AGI, and if that is the case, then this makes a lot of sense.
    - Connor Leahy 8 Sep 2023 8:59 UTC
      3 points
      0
      Parent
      I strongly disagree and do not think that will be how AGI will look, AGI isn’t magic. But this is a crux and I might be wrong of course.
    - Noosphere89 7 Sep 2023 14:41 UTC
      2 points
      0
      Parent
      Yep, the latency and performance are real killers for embodied type cognition. I remember a tweet that suggested the entire Internet was not enough to train the model.
  - Gunnar_Zarncke 30 Aug 2023 14:25 UTC
    2 points
    0
    Parent
    It would be nice if the AGI saw the humans running its compute resources as part of its body that it wants to protect. The problem is that we humans also tamper with our bodies… Humans are like hair on the body of the AGI and maybe it wants to shave and use a whig.
- DusanDNesic 29 Aug 2023 20:22 UTC
  5 points
  3
  Parent
  Even for humans—are my nails me? Once clipped, are they me? Is my phone me? I feel like my phone is more me than my hair, for example. Is my child me, are my memes me, is my country me, etc etc… There are many reasons why agent boundaries are problematic, and that problem continues in AI Safety research.
- Carl Feynman 29 Aug 2023 21:08 UTC
  4 points
  1
  Parent
  Even worse: existing AI systems can call systems under the control of other companies, can write their own software and call it, or can be called by systems that are not themselves AI. How do you ensure they are safe under all permutations of such activities?
  You could say “Well, don’t do that, then,” but that horse has left the barn.
- Logan Riggs 31 Aug 2023 16:39 UTC
  2 points
  0
  Parent
  Wait, I don’t understand this at all. For language models, the environment is the text. For different environments, those training datasets will be the environment.
  - Gunnar_Zarncke 31 Aug 2023 20:06 UTC
    2 points
    0
    Parent
    This is not primarily about LLMs, which are Simulators (see also Janus’ Simulators), but about more general systems—AGIs.
    - Logan Riggs 31 Aug 2023 23:26 UTC
      2 points
      0
      Parent
      I meant to cover this in the “for different environments” parts. Like if we self-play on certain games, we’ll still have access to those games.