Charlie Steiner comments on Selection Theorems: A Program For Understanding Agents

Charlie Steiner 29 Sep 2021 14:07 UTC
LW: 7 AF: 4
AF
Hm. Suppose sometimes I want to model humans as having propositional beliefs, and other times I want to model humans as having probabilistic beliefs, and still other times I want to model human beliefs as a set of contexts and a transition function. What’s stopping me?
I think it depends on the application. What seems like the obvious application is building an AI that models human beliefs, or human preferences. What are some of the desiderata we use when choosing how we want an AI to model us, and how do these compare to typical desiderata used in picking model classes for agents?
I like Savage, so I’ll pick on him. Before you even get into what he considers the “real” desiderata, he wants to say that there’s a set of actions which are functions from states to consequences, and this set is closed under the operation of using one action for some arbitrary states and another action for the rest. But humans very don’t work that way—I’d want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking.
Or if we’re thinking about modeling humans to extract the “preferences” part of the model: Suppose Person A wants to get out a function that ranks actions, while Person B wants to learn a utility function, its domain of validity, and a custom world-model that the utility function lives in. What’s the model for how something like a selection theorem will help them resolve their differences?
- johnswentworth 29 Sep 2021 16:22 UTC
  LW: 2 AF: 2
  AF Parent
  You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That’s a type signature question. Put the two together, and we have a selection-theorem-shaped question.
  In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theorems suggest, rather than a utility function, but if for some reason they really wanted a utility function they would probably compute it as an approximation, compute the domain of validity of the approximation, etc. For person A, turning the relevant types into an action-ranking would likely work much the same way that turning e.g. a utility function into an action-ranking works—i.e. just compute the utility (or whatever metrics turn out to be relevant) and sort. Regardless, if extracting preferences, both of them would probably want to work internally with the type signatures suggested by the theorems.
  - Charlie Steiner 29 Sep 2021 18:15 UTC
    LW: 2 AF: 1
    AF Parent
    We can imagine modeling humans in purely psychological ways with no biological inspiration, so I think you’re saying that you want to look at the “natural constraints” on representations / processes, and then in a sense generalize or over-charge those constraints to narrow down model choices?
    - johnswentworth 29 Sep 2021 21:22 UTC
      LW: 2 AF: 2
      AF Parent
      Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to interface with our models in any other way than querying for probabilities over the low-level state of the system.
      - Charlie Steiner 30 Sep 2021 15:53 UTC
        LW: 2 AF: 1
        AF Parent
        Right. I think I’m more of the opinion that we’ll end up choosing those interfaces via desiderata that apply more directly to the interface (like “we want to be able to compare two models’ ratings of the same possible future”), rather than indirect desiderata on “how a practical agent should look” that we keep adding to until an interface pops out.
        johnswentworth 30 Sep 2021 16:06 UTC
        LW: 3 AF: 3
        AF Parent
        The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can’t just be like “I want an interface which does X”; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.
        An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn’t multimodal in any important way. If it is importantly multimodal, then any point estimate will be very misleading/confusing/antihelpful.
        More generally, the take away here is “we don’t get to arbitrarily choose the type signature”; that choice is dependent on properties of the system.
        Charlie Steiner 30 Sep 2021 20:36 UTC
        LW: 2 AF: 1
        AF Parent
        This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled “world model” in the code, and others are labeled “preferences”, and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn’t need to respect our preconceptions. What the model really “wants” to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we’re trying to force on it.
        
        For this problem, which might not be what you’re talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it’s an inference method that trades off between model complexity and error in a different way.
        
        (P.S.: I think there is an “obvious” way to do that, and it’s MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)