johnswentworth comments on Selection Theorems: A Program For Understanding Agents

johnswentworth Sep 29, 2021, 9:22 PM
LW: 2 AF: 2
AF
Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to interface with our models in any other way than querying for probabilities over the low-level state of the system.
- Charlie Steiner Sep 30, 2021, 3:53 PM
  LW: 2 AF: 1
  AF Parent
  Right. I think I’m more of the opinion that we’ll end up choosing those interfaces via desiderata that apply more directly to the interface (like “we want to be able to compare two models’ ratings of the same possible future”), rather than indirect desiderata on “how a practical agent should look” that we keep adding to until an interface pops out.
  - johnswentworth Sep 30, 2021, 4:06 PM
    LW: 3 AF: 3
    AF Parent
    The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can’t just be like “I want an interface which does X”; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.
    An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn’t multimodal in any important way. If it is importantly multimodal, then any point estimate will be very misleading/confusing/antihelpful.
    More generally, the take away here is “we don’t get to arbitrarily choose the type signature”; that choice is dependent on properties of the system.
    - Charlie Steiner Sep 30, 2021, 8:36 PM
      LW: 2 AF: 1
      AF Parent
      This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled “world model” in the code, and others are labeled “preferences”, and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn’t need to respect our preconceptions. What the model really “wants” to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we’re trying to force on it.
      
      For this problem, which might not be what you’re talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it’s an inference method that trades off between model complexity and error in a different way.
      
      (P.S.: I think there is an “obvious” way to do that, and it’s MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)