1. AI systems which pursue goals are also known as mesa-optimisers, as coined in Hubinger et al’s paper _Risks from Learned Optimisation in Advanced Machine Learning Systems.
Nitpicky, but I think it would be nice to write explicitly that here the AI systems are learned, because the standard definition of mesa-optimizers is of optimized optimizers. Also, I think it would be better to explicitly say that mesa-optimizers are optimizers. Given your criteria of goal-directed agency, that’s implicit, but at this point the criteria are not yet stated.
Meanwhile, Dennett argues that taking the intentional stance towards systems can be useful for making predictions about them—but this only works given prior knowledge about what goals they’re most likely to have. Predicting the behaviour of a trillion-parameter neural network is very different from applying the intentional stance to existing artifacts. And while we do have an intuitive understanding of complex human goals and how they translate to behaviour, the extent to which it’s reasonable to extend those beliefs about goal-directed cognition to artificial intelligences is the very question we need a theory of agency to answer. So while Dennett’s framework provides some valuable insights—in particular, that assigning agency to a system is a modelling choice which only applies at certain levels of abstraction—I think it fails to reduce agency to simpler and more tractable concepts.
I agree with you that the intentional stance requires some assumption about the goals of the system you’re applying it too. But I disagree on the fact that this makes it very hard to apply the intentional stance to, let’s say neural networks. That’s because I think that goals have some special structure (being compressed for example), which means that there’s not that many different goals. So the intentional stance does reduce goal-directedness to simpler concepts like goals, and gives additional intuitions on them.
That being said, I also have issues with the intentional stance. Most problematic is the fact that it doesn’t give you a way to compute the goal-directedness of a system.
About your criteria, I have a couple of questions/observations.
Combining 1,2 and 3 seems to yield an optimizer in disguise: something that plans according to some utility/objective, in an embedded way. The change with mesa-optimizers (or simply optimizers) is that you treat separately the ingredients of optimization, but it still has the same problem of needing an objective it can use (for point 3).
About 4, I think I see where you’re aiming at (having long-term goals), but I’m confused by the way it is written. It depends on the objective/utility from 3, but it’s not clear what sensitive means for an objective. Do you mean that the objective values more long-term plans? That it doesn’t discount with length of plans? Or instead something more like the expanding moral circle, where the AI has an objective that treats equally near-future and far-future, and near and far things?
Also about 5, coherent goals (in the sense of goals that don’t change) is a very dangerous case, but I’m not convinced that goal-directed agents must have one goal forever.
I agree completely about 6. It’s very close to the distinction between habitual behavior and goal-directed behavior in psychology.
On the examples of lacking 2, I feel like the ones you’re giving could be goal-directed. For example limiting the actions or context doesn’t necessarily ensure the lack of goal-directedness, it is more about making a deceptive plan harder to pull off.
Your definition of goals looks like a more constrained utility functions, defined on equivalence classes of states/outcomes as abstracted by the agent’s internal concepts. Is it correct? If so, do you have an idea of what specific properties such utility functions could have as a consequence. I’m interested in that, because I would really like a way to define a goal as a behavioral objective satisfying some structural constraints.
I think it would be nice to write explicitly that here the AI systems are learned
Good point, fixed.
coherent goals (in the sense of goals that don’t change) is a very dangerous case, but I’m not convinced that goal-directed agents must have one goal forever.
We should categorise things as goal-directed agents if it scores highly on most of these criteria, not just if it scores perfectly on all of them. So I agree that you don’t need one goal forever, but you do need it for more than a few minutes. And internal unification also means that the whole system is working towards this.
examples of lacking 2, I feel like the ones you’re giving could be goal-directed
Same here: lacking this doesn’t guarantee a lack of goal-directedness, but it’s one contributing factor. As another example, we might say that humans often plan in a restricted way: only do things that you’ve seen other people do before. And this definitely makes us less goal-directed.
it’s not clear what sensitive means for an objective. Do you mean that the objective values more long-term plans? That it doesn’t discount with length of plans? Or instead something more like the expanding moral circle, where the AI has an objective that treats equally near-future and far-future, and near and far things?
By “sensitive” I merely mean that differences in expected long-term or large-scale outcomes sometimes lead to differences in current choices.
The change with mesa-optimizers (or simply optimizers) is that you treat separately the ingredients of optimization, but it still has the same problem of needing an objective it can use
Yeah, I think there’s still much more to be done to make this clearer. I guess my criticism of mesa-optimisers was that they talked about explicit representation of the objective function (whatever that means). Whereas I think my definition relies more on the values of choices being represented. Idk how much of an improvement this is.
Your definition of goals looks like a more constrained utility functions, defined on equivalence classes of states/outcomes as abstracted by the agent’s internal concepts. Is it correct?
I don’t really know what it means for something to be a utility function. I assume you could interpret it that way, but my definition of goals also includes deontological goals, which would make that interpretation harder. I like the “equivalence classes” thing more, but I’m not confident enough about the space of all possible internal concepts to claim that it’s always a good fit.
do you have an idea of what specific properties such utility functions could have as a consequence
I expect that asking “what properties do these utility functions have” will be generally more misleading than asking “what properties do these goals have”, because the former gives you an illusion of mathematical transparency. My tentative answer to the latter question is that, due to Moravec’s paradox, they will have the properties of high-level human thought more than they have the properties of low-level human thought. But I’m still pretty confused about this.
We should categorise things as goal-directed agents if it scores highly on most of these criteria, not just if it scores perfectly on all of them. So I agree that you don’t need one goal forever, but you do need it for more than a few minutes. And internal unification also means that the whole system is working towards this.
If coherence is about having the same goal for a “long enough” period of time, then it makes sense to me.
By “sensitive” I merely mean that differences in expected long-term or large-scale outcomes sometimes lead to differences in current choices.
So the think that judges outcomes in the goal-directed agent is “not always privileging short-term outcomes”? Then I guess it’s also a scale, because there’s a big difference between a system that has one case where it privileges long-term outcomes over short-term ones, and a system that focuses on long-term outcomes.
Yeah, I think there’s still much more to be done to make this clearer. I guess my criticism of mesa-optimisers was that they talked about explicit representation of the objective function (whatever that means). Whereas I think my definition relies more on the values of choices being represented. Idk how much of an improvement this is.
I agree that the explicit representation of the objective is weird. But on the other hand, it’s an explicit and obvious weirdness, that either calls for clarification or changes. Whereas in your criteria, I feel that essentially the same idea is made implicit/less weird, without actually bringing a better solution. Your approach might be better in the long run, possible because rephrasing the question in these terms lets us find a non weird way to define this objective.
I just wanted to point out that in our current state of knowledge, I feel like there are drawbacks in “hiding” the weirdness like you do.
I don’t really know what it means for something to be a utility function. I assume you could interpret it that way, but my definition of goals also includes deontological goals, which would make that interpretation harder. I like the “equivalence classes” thing more, but I’m not confident enough about the space of all possible internal concepts to claim that it’s always a good fit.
One idea I had for defining goals is as a temporal logic property (for example in LTL) on states. That lets you express things like “I want to reach one of these states” or “I never want to reach this state”; the latter looks like a deontological proprety to me. Thinking some more about this led me see two issues:
First, it doesn’t let you encode preferences of some state over another. That might be solvable by adding an partial order with nice properties, like Stuart Armstrong’s partial preferences.
Second, the system doesn’t have access to the states of the world, it has access to its abstractions of those states. Here we go back to the equivalence classes idea. Maybe a way to cash in your internal abstractions and Paul’s ascriptions of beliefs is through an equivalence relation on the states of the world, such that the goal of the system is defined on the equivalence classes for this relation.
I expect that asking “what properties do these utility functions have” will be generally more misleading than asking “what properties do these goals have”, because the former gives you an illusion of mathematical transparency. My tentative answer to the latter question is that, due to Moravec’s paradox, they will have the properties of high-level human thought more than they have the properties of low-level human thought. But I’m still pretty confused about this.
Agreed that the first step should be the properties of goals. I just also believe that if you get some nice properties of goals, you might know what constraints to add to utility functions to make them more “goal-like”.
Your last sentence seems contradictory with what you wrote about Dennett. Like I understand it as you saying “goals would be like high level human goals”, while your criticism of Dennett was that the intentional stance doesn’t necessarily works on NNs because they don’t have to have the same kind of goals than us. Am I wrong about one of those opinions?
Nitpicky, but I think it would be nice to write explicitly that here the AI systems are learned, because the standard definition of mesa-optimizers is of optimized optimizers. Also, I think it would be better to explicitly say that mesa-optimizers are optimizers. Given your criteria of goal-directed agency, that’s implicit, but at this point the criteria are not yet stated.
I agree with you that the intentional stance requires some assumption about the goals of the system you’re applying it too. But I disagree on the fact that this makes it very hard to apply the intentional stance to, let’s say neural networks. That’s because I think that goals have some special structure (being compressed for example), which means that there’s not that many different goals. So the intentional stance does reduce goal-directedness to simpler concepts like goals, and gives additional intuitions on them.
That being said, I also have issues with the intentional stance. Most problematic is the fact that it doesn’t give you a way to compute the goal-directedness of a system.
About your criteria, I have a couple of questions/observations.
Combining 1,2 and 3 seems to yield an optimizer in disguise: something that plans according to some utility/objective, in an embedded way. The change with mesa-optimizers (or simply optimizers) is that you treat separately the ingredients of optimization, but it still has the same problem of needing an objective it can use (for point 3).
About 4, I think I see where you’re aiming at (having long-term goals), but I’m confused by the way it is written. It depends on the objective/utility from 3, but it’s not clear what sensitive means for an objective. Do you mean that the objective values more long-term plans? That it doesn’t discount with length of plans? Or instead something more like the expanding moral circle, where the AI has an objective that treats equally near-future and far-future, and near and far things?
Also about 5, coherent goals (in the sense of goals that don’t change) is a very dangerous case, but I’m not convinced that goal-directed agents must have one goal forever.
I agree completely about 6. It’s very close to the distinction between habitual behavior and goal-directed behavior in psychology.
On the examples of lacking 2, I feel like the ones you’re giving could be goal-directed. For example limiting the actions or context doesn’t necessarily ensure the lack of goal-directedness, it is more about making a deceptive plan harder to pull off.
Your definition of goals looks like a more constrained utility functions, defined on equivalence classes of states/outcomes as abstracted by the agent’s internal concepts. Is it correct? If so, do you have an idea of what specific properties such utility functions could have as a consequence. I’m interested in that, because I would really like a way to define a goal as a behavioral objective satisfying some structural constraints.
Thanks for the comments!
Good point, fixed.
We should categorise things as goal-directed agents if it scores highly on most of these criteria, not just if it scores perfectly on all of them. So I agree that you don’t need one goal forever, but you do need it for more than a few minutes. And internal unification also means that the whole system is working towards this.
Same here: lacking this doesn’t guarantee a lack of goal-directedness, but it’s one contributing factor. As another example, we might say that humans often plan in a restricted way: only do things that you’ve seen other people do before. And this definitely makes us less goal-directed.
By “sensitive” I merely mean that differences in expected long-term or large-scale outcomes sometimes lead to differences in current choices.
Yeah, I think there’s still much more to be done to make this clearer. I guess my criticism of mesa-optimisers was that they talked about explicit representation of the objective function (whatever that means). Whereas I think my definition relies more on the values of choices being represented. Idk how much of an improvement this is.
I don’t really know what it means for something to be a utility function. I assume you could interpret it that way, but my definition of goals also includes deontological goals, which would make that interpretation harder. I like the “equivalence classes” thing more, but I’m not confident enough about the space of all possible internal concepts to claim that it’s always a good fit.
I expect that asking “what properties do these utility functions have” will be generally more misleading than asking “what properties do these goals have”, because the former gives you an illusion of mathematical transparency. My tentative answer to the latter question is that, due to Moravec’s paradox, they will have the properties of high-level human thought more than they have the properties of low-level human thought. But I’m still pretty confused about this.
Thanks for the answers!
If coherence is about having the same goal for a “long enough” period of time, then it makes sense to me.
So the think that judges outcomes in the goal-directed agent is “not always privileging short-term outcomes”? Then I guess it’s also a scale, because there’s a big difference between a system that has one case where it privileges long-term outcomes over short-term ones, and a system that focuses on long-term outcomes.
I agree that the explicit representation of the objective is weird. But on the other hand, it’s an explicit and obvious weirdness, that either calls for clarification or changes. Whereas in your criteria, I feel that essentially the same idea is made implicit/less weird, without actually bringing a better solution. Your approach might be better in the long run, possible because rephrasing the question in these terms lets us find a non weird way to define this objective.
I just wanted to point out that in our current state of knowledge, I feel like there are drawbacks in “hiding” the weirdness like you do.
One idea I had for defining goals is as a temporal logic property (for example in LTL) on states. That lets you express things like “I want to reach one of these states” or “I never want to reach this state”; the latter looks like a deontological proprety to me. Thinking some more about this led me see two issues:
First, it doesn’t let you encode preferences of some state over another. That might be solvable by adding an partial order with nice properties, like Stuart Armstrong’s partial preferences.
Second, the system doesn’t have access to the states of the world, it has access to its abstractions of those states. Here we go back to the equivalence classes idea. Maybe a way to cash in your internal abstractions and Paul’s ascriptions of beliefs is through an equivalence relation on the states of the world, such that the goal of the system is defined on the equivalence classes for this relation.
Agreed that the first step should be the properties of goals. I just also believe that if you get some nice properties of goals, you might know what constraints to add to utility functions to make them more “goal-like”.
Your last sentence seems contradictory with what you wrote about Dennett. Like I understand it as you saying “goals would be like high level human goals”, while your criticism of Dennett was that the intentional stance doesn’t necessarily works on NNs because they don’t have to have the same kind of goals than us. Am I wrong about one of those opinions?