Daniel Kokotajlo comments on Coherence arguments imply a force for goal-directed behavior

Daniel Kokotajlo 28 Apr 2021 8:47 UTC
LW: 8 AF: 5
AF
I love your health points analogy. Extending it, imagine that someone came up with “coherence arguments” that showed that for a rational doctor doing triage on patients, and/or for a group deciding who should do a risky thing that might result in damage, the optimal strategy involves a construct called “health points” such that:
--Each person at any given time has some number of health points
--Whenever someone reaches 0 health points, they (very probably) die
--Similar afflictions/disasters tend to cause similar amounts of decrease in health points, e.g. a bullet in the thigh causes me to lose 5 hp and you to lose 5 hp and Katja to lose 5hp.
Wouldn’t these coherence arguments be pretty awesome? Wouldn’t this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?
This is so despite the fact that someone could come along and say “Well these coherence arguments assume a concept (our intuitive concept) of ‘damage,’ they don’t tell us what ‘damage’ means. (Ditto for concepts like ‘die’ and ‘person’ and ‘similar’) That would be true, and it would still be a good idea to do further deconfusion research along those lines, but it wouldn’t detract much from the epistemic victory the coherence arguments won.
- Richard_Ngo 28 Apr 2021 9:53 UTC
  LW: 11 AF: 9
  AF Parent
  Wouldn’t these coherence arguments be pretty awesome? Wouldn’t this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?
  Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)
  But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle—for example, you say “our aim is to build an artificial liver with the most HP possible”—then I’m worried that this would harm your ability to understand what a healthy liver looks like on the level of cells, or tissues, or metabolic pathways, or roles within the digestive system. Because HP is just not a well-defined concept at that level of resolution.
  Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole—concepts which people just don’t talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they’re making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.
  - adamShimi 28 Apr 2021 10:43 UTC
    LW: 2 AF: 1
    AF Parent
    Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole—concepts which people just don’t talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they’re making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.
    I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other. Where here the meaningful comes from thinking about cognition and what following such a utility function would entail. There’s a pretty intuitive sense in which a utility function that encodes exactly a trajectory and nothing else, for a complex enough setting, doesn’t look like a goal.
    A difference between us I think is that I expect that we can add structure that restricts the set of utility functions we consider (structure that comes from thinking among other things about cognition) such that maximizing the expected utility for such a constrained utility function would actually capture most if not all the aspect of goal-directedness that matters to us.
    My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn’t have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts. So either adapting the state space and action space, or going for fixed spaces but mapping/equivalence classes/metrics on them that encode the relevant assumptions about cognition.
    - Richard_Ngo 28 Apr 2021 13:52 UTC
      LW: 2 AF: 1
      AF Parent
      My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn’t have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts.
      Yeah, this is an accurate portrayal of my views. I’d also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn’t take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)
      I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other.
      I don’t think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)
      - adamShimi 28 Apr 2021 15:58 UTC
        LW: 2 AF: 1
        AF Parent
        Yeah, this is an accurate portrayal of my views. I’d also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn’t take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)
        My first intuition is that I expect mapping internal concept to mathematical formalisms to be easier when the end goal is deconfusion and making sense of behaviors, compared to actually improving capabilities. But I’d have to think about it some more. Thanks at least for an interesting test to try to apply to my attempt.
        I don’t think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)
        Okay, do you mean that you agree with my paragraph but what you are really arguing about is that utility functions don’t care about the low-level/internals of the system, and that’s why they’re bad abstractions? (That’s how I understand your liver and health points example).