Thanks for writing this post, Katja; I’m very glad to see more engagement with these arguments. However, I don’t think the post addresses my main concern about the original coherence arguments for goal-directedness, which I’d frame as follows:
There’s some intuitive conception of goal-directedness, which is worrying in the context of AI. The old coherence arguments implicitly used the concept of EU-maximisation as a way of understanding goal-directedness. But Rohin demonstrated that the most straightforward conception of EU-maximisation (which I’ll call behavioural EU-maximisation) is inadequate as a theory of goal-directedness, because it applies to any agent. In order to fix this problem, the main missing link is not a stronger (probabilistic) argument for why AGIs will be coherent EU-maximisers, but rather an explanation of what it even means for a real-world agent to be a coherent EU-maximiser, which we don’t currently have.
By “behavioural EU-maximisation”, I mean thinking of a utility function as something that we define purely in terms of an agent’s behaviour. In response to this, you identify an alternative definition of expected utility maximisation which isn’t purely behavioural, but also refers to an agent’s internal features:
An outside observer being able to rationalize a sequence of observed behavior as coherent doesn’t mean that the behavior is actually coherent. Coherence arguments constrain combinations of external behavior and internal features—‘preferences’ and beliefs. So whether an actor is coherent depends on what preferences and beliefs it actually has.
But you don’t characterise those internal features in a satisfactory way, or point to anyone else who does. The closest you get is in your footnote, where you fall back on a behavioural definition of preferences:
When exactly an aspect of these should be considered a ‘preference’ for the sake of this argument isn’t entirely clear to me, but would seem to depend on something like whether it tends to produce actions favoring certain outcomes over other outcomes across a range of circumstances
I’m sympathetic to this, because it’s hard to define preferences without reference to behaviour. We just don’t know enough about cognitive science yet to do so. But it means that your conception of EU-maximisation is still vulnerable to Rohin’s criticisms of behavioural EU-maximisation, because you still have to extract preferences from behaviour.
From my perspective, then, claims like “Anything that weakly has goals has reason to reform to become an EU maximizer” (as made in this comment) miss the crux of the disagreement. It’s not that I believe the claim is false; I just don’t know what it means, and I don’t think anyone else does either. Unfortunately the fact that their are theorems about EU maximisation in some restricted formalisms make people think that it’s a concept which is well-defined in real-world agents to a much greater extent than it actually is.
Here’s an exaggerated analogy to help convey what I mean by “well-defined concept”. Characters in games often have an attribute called health points (HP), and die when their health points drop to 0. Conceivably you could prove a bunch of theorems about health points in a certain class of games, e.g. that having more is always good. Okay, so is having more health points always good for real-world humans (or AIs)? I mean, we must have something like the health point formalism used in games, because if we take too much damage, we die! Sure, some critics say that defining health points in terms of external behaviour (like dying) is vacuous—but health points aren’t just about behaviour, we can also define them in terms of an agent’s internal features (like the tendency to die in a range of circumstances).
I would say that EU is like “health points”: a concept which is interesting to reason about in some formalisms, and which is clearly related to an important real-world concept, but whose relationship to that non-formal real-world concept we don’t yet understand well. Perhaps continued investigation can fix this; I certainly hope so! But in the meantime, using “EU-maximisation” instead of “goal-directedness” feels similar to using “health points” as a substitute for “health”—its main effect is to obscure our conceptual confusion under a misleading layer of formalism, thereby making the associated arguments seem stronger than they actually are.
I love your health points analogy. Extending it, imagine that someone came up with “coherence arguments” that showed that for a rational doctor doing triage on patients, and/or for a group deciding who should do a risky thing that might result in damage, the optimal strategy involves a construct called “health points” such that:
--Each person at any given time has some number of health points
--Whenever someone reaches 0 health points, they (very probably) die
--Similar afflictions/disasters tend to cause similar amounts of decrease in health points, e.g. a bullet in the thigh causes me to lose 5 hp and you to lose 5 hp and Katja to lose 5hp.
Wouldn’t these coherence arguments be pretty awesome? Wouldn’t this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?
This is so despite the fact that someone could come along and say “Well these coherence arguments assume a concept (our intuitive concept) of ‘damage,’ they don’t tell us what ‘damage’ means. (Ditto for concepts like ‘die’ and ‘person’ and ‘similar’) That would be true, and it would still be a good idea to do further deconfusion research along those lines, but it wouldn’t detract much from the epistemic victory the coherence arguments won.
Wouldn’t these coherence arguments be pretty awesome? Wouldn’t this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?
Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)
But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle—for example, you say “our aim is to build an artificial liver with the most HP possible”—then I’m worried that this would harm your ability to understand what a healthy liver looks like on the level of cells, or tissues, or metabolic pathways, or roles within the digestive system. Because HP is just not a well-defined concept at that level of resolution.
Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole—concepts which people just don’t talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they’re making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.
Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole—concepts which people just don’t talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they’re making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.
I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other. Where here the meaningful comes from thinking about cognition and what following such a utility function would entail. There’s a pretty intuitive sense in which a utility function that encodes exactly a trajectory and nothing else, for a complex enough setting, doesn’t look like a goal.
A difference between us I think is that I expect that we can add structure that restricts the set of utility functions we consider (structure that comes from thinking among other things about cognition) such that maximizing the expected utility for such a constrained utility function would actually capture most if not all the aspect of goal-directedness that matters to us.
My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn’t have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts. So either adapting the state space and action space, or going for fixed spaces but mapping/equivalence classes/metrics on them that encode the relevant assumptions about cognition.
My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn’t have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts.
Yeah, this is an accurate portrayal of my views. I’d also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn’t take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)
I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other.
I don’t think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)
Yeah, this is an accurate portrayal of my views. I’d also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn’t take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)
My first intuition is that I expect mapping internal concept to mathematical formalisms to be easier when the end goal is deconfusion and making sense of behaviors, compared to actually improving capabilities. But I’d have to think about it some more. Thanks at least for an interesting test to try to apply to my attempt.
I don’t think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)
Okay, do you mean that you agree with my paragraph but what you are really arguing about is that utility functions don’t care about the low-level/internals of the system, and that’s why they’re bad abstractions? (That’s how I understand your liver and health points example).
Thanks for writing this post, Katja; I’m very glad to see more engagement with these arguments. However, I don’t think the post addresses my main concern about the original coherence arguments for goal-directedness, which I’d frame as follows:
There’s some intuitive conception of goal-directedness, which is worrying in the context of AI. The old coherence arguments implicitly used the concept of EU-maximisation as a way of understanding goal-directedness. But Rohin demonstrated that the most straightforward conception of EU-maximisation (which I’ll call behavioural EU-maximisation) is inadequate as a theory of goal-directedness, because it applies to any agent. In order to fix this problem, the main missing link is not a stronger (probabilistic) argument for why AGIs will be coherent EU-maximisers, but rather an explanation of what it even means for a real-world agent to be a coherent EU-maximiser, which we don’t currently have.
By “behavioural EU-maximisation”, I mean thinking of a utility function as something that we define purely in terms of an agent’s behaviour. In response to this, you identify an alternative definition of expected utility maximisation which isn’t purely behavioural, but also refers to an agent’s internal features:
But you don’t characterise those internal features in a satisfactory way, or point to anyone else who does. The closest you get is in your footnote, where you fall back on a behavioural definition of preferences:
I’m sympathetic to this, because it’s hard to define preferences without reference to behaviour. We just don’t know enough about cognitive science yet to do so. But it means that your conception of EU-maximisation is still vulnerable to Rohin’s criticisms of behavioural EU-maximisation, because you still have to extract preferences from behaviour.
From my perspective, then, claims like “Anything that weakly has goals has reason to reform to become an EU maximizer” (as made in this comment) miss the crux of the disagreement. It’s not that I believe the claim is false; I just don’t know what it means, and I don’t think anyone else does either. Unfortunately the fact that their are theorems about EU maximisation in some restricted formalisms make people think that it’s a concept which is well-defined in real-world agents to a much greater extent than it actually is.
Here’s an exaggerated analogy to help convey what I mean by “well-defined concept”. Characters in games often have an attribute called health points (HP), and die when their health points drop to 0. Conceivably you could prove a bunch of theorems about health points in a certain class of games, e.g. that having more is always good. Okay, so is having more health points always good for real-world humans (or AIs)? I mean, we must have something like the health point formalism used in games, because if we take too much damage, we die! Sure, some critics say that defining health points in terms of external behaviour (like dying) is vacuous—but health points aren’t just about behaviour, we can also define them in terms of an agent’s internal features (like the tendency to die in a range of circumstances).
I would say that EU is like “health points”: a concept which is interesting to reason about in some formalisms, and which is clearly related to an important real-world concept, but whose relationship to that non-formal real-world concept we don’t yet understand well. Perhaps continued investigation can fix this; I certainly hope so! But in the meantime, using “EU-maximisation” instead of “goal-directedness” feels similar to using “health points” as a substitute for “health”—its main effect is to obscure our conceptual confusion under a misleading layer of formalism, thereby making the associated arguments seem stronger than they actually are.
I love your health points analogy. Extending it, imagine that someone came up with “coherence arguments” that showed that for a rational doctor doing triage on patients, and/or for a group deciding who should do a risky thing that might result in damage, the optimal strategy involves a construct called “health points” such that:
--Each person at any given time has some number of health points
--Whenever someone reaches 0 health points, they (very probably) die
--Similar afflictions/disasters tend to cause similar amounts of decrease in health points, e.g. a bullet in the thigh causes me to lose 5 hp and you to lose 5 hp and Katja to lose 5hp.
Wouldn’t these coherence arguments be pretty awesome? Wouldn’t this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?
This is so despite the fact that someone could come along and say “Well these coherence arguments assume a concept (our intuitive concept) of ‘damage,’ they don’t tell us what ‘damage’ means. (Ditto for concepts like ‘die’ and ‘person’ and ‘similar’) That would be true, and it would still be a good idea to do further deconfusion research along those lines, but it wouldn’t detract much from the epistemic victory the coherence arguments won.
Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)
But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle—for example, you say “our aim is to build an artificial liver with the most HP possible”—then I’m worried that this would harm your ability to understand what a healthy liver looks like on the level of cells, or tissues, or metabolic pathways, or roles within the digestive system. Because HP is just not a well-defined concept at that level of resolution.
Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole—concepts which people just don’t talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they’re making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.
I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other. Where here the meaningful comes from thinking about cognition and what following such a utility function would entail. There’s a pretty intuitive sense in which a utility function that encodes exactly a trajectory and nothing else, for a complex enough setting, doesn’t look like a goal.
A difference between us I think is that I expect that we can add structure that restricts the set of utility functions we consider (structure that comes from thinking among other things about cognition) such that maximizing the expected utility for such a constrained utility function would actually capture most if not all the aspect of goal-directedness that matters to us.
My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn’t have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts. So either adapting the state space and action space, or going for fixed spaces but mapping/equivalence classes/metrics on them that encode the relevant assumptions about cognition.
Yeah, this is an accurate portrayal of my views. I’d also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn’t take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)
I don’t think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)
My first intuition is that I expect mapping internal concept to mathematical formalisms to be easier when the end goal is deconfusion and making sense of behaviors, compared to actually improving capabilities. But I’d have to think about it some more. Thanks at least for an interesting test to try to apply to my attempt.
Okay, do you mean that you agree with my paragraph but what you are really arguing about is that utility functions don’t care about the low-level/internals of the system, and that’s why they’re bad abstractions? (That’s how I understand your liver and health points example).