Daniel Kokotajlo comments on Fixing The Good Regulator Theorem

Daniel Kokotajlo 10 Feb 2021 12:16 UTC
LW: 8 AF: 4
AF
Thanks, this is really cool! I don’t know much about this stuff so I may be getting over-hyped, but still.
I’d love to see a follow-up to this post that starts with the Takeaway and explains how this would work in practice for a big artificial neural net undergoing training. Something like this, I expect:
--There are lottery tickets that make pretty close to optimal decisions, and as training progresses they get increasingly more weight until eventually the network is dominated by one or more of the close-to-optimal lottery tickets.
--Because of key conditions 2 and 3, the optimal tickets will involve some sort of subcomponent that compresses information from X and stores it to later be combined with Y.
--Key condition 4 might not be strictly true in practice; it might not be that our dataset of training examples is so diverse that literally every way the distribution over S could vary corresponds to a different optimal behavior. And even if our dataset was that diverse, it might take a long time to learn our way through the entire dataset. However, what we CAN say is that the subcomponent that compresses information from X and stores it to later be combined with Y will increasingly (as training continues) store “all and only the relevant information,” i.e. all and only the information that thus-far-in-training has mattered to performance. Moreover, intuitively there is an extent to which Y can “choose many different games,” an extent to which the training data so far has “made relevant” information about various aspects of S. To the extent that Y can choose many different games—that is, to the extent that the training data makes aspects of S relevant—the network will store information about those aspects of S.
--Thus, for a neural network being trained on some very complex open-ended real-world task/environment, it’s plausible that Y can “choose many different games” to a large extent, such that the close-to-optimal lottery tickets will have an information-compressing-and-retaining subcomponent that contains lots of information about S but not much information about X. In particular it in some sense “represents all the aspects of S that are likely to be relevant to decision-making.”
--Intuitively, this is sufficient for us to be confident in saying: Neural nets trained on very complex open-ended real-world tasks/environments will build, remember, and use internal models of their environments. (Maybe we even add something like …for something which resembles expected utility maximization!)
Anyhow. I think this is important enough that there should be a published paper laying all this out. It should start with the theorem you just proved/fixed, and then move on to the neural net context like I just sketched. This is important because it’s something a bunch of people would want to cite as a building block for arguments about agency, mesa-optimizers, human modelling, etc. etc.
What links here?
- Daniel Kokotajlo's comment on Fixing The Good Regulator Theorem by johnswentworth (15 Jan 2023 3:08 UTC; 9 points)
- johnswentworth 10 Feb 2021 17:22 UTC
  LW: 4 AF: 3
  AF Parent
  Your bullet points are basically correct. In practice, applying the theorem to any particular NN would require some careful setup to make the causal structure match—i.e. we have to designate the right things as “system”, “regulator”, “map”, “inputs X & Y”, and “outcome”, and that will vary from architecture to architecture. But I expect it can be applied to most architectures used in practice.
  I’m probably not going to turn this into a paper myself soon. At the moment, I’m pursuing threads which I think are much more promising—in particular, thinking about when a “regulator’s model” mirrors the structure of the system/environment, not just its black-box functionality. This was just a side-project within that pursuit. If someone else wants to turn this into a paper, I’d be happy to help, and there’s enough technical work to be done in applying it to NNs that you wouldn’t just be writing up this post.
  - Daniel Kokotajlo 10 Feb 2021 17:38 UTC
    LW: 2 AF: 1
    AF Parent
    Doesn’t sound like a job for me, but would you consider e.g. getting a grant to hire someone to coauthor this with you? I think the “getting a grant” part would not be the hard part.
    - johnswentworth 10 Feb 2021 20:02 UTC
      LW: 4 AF: 3
      AF Parent
      Yeah, “get a grant” is definitely not the part of that plan which is a hard sell. Hiring people is a PITA. If I ever get to a point where I have enough things like this, which could relatively-easily be offloaded to another person, I’ll probably do it. But at this point, no.