Locality of goals
Introduction
Studying goal-directedness produces two kinds of questions: questions about goals, and questions about being directed towards a goal. Most of my previous posts focused on the second kind; this one shifts to the first kind.
Assume some goal-directed system with a known goal. The nature of this goal will influence which issues of safety the system might have. If the goal focuses on the input, the system might wirehead itself and/or game its specification. On the other hand, if the goal lies firmly in the environment, the system might have convergent instrumental subgoals and/or destroy any unspecified value.
Locality aims at capturing this distinction.
Intuitively, the locality of the system’s goal captures how far away from the system one must look to check the accomplishment of the goal.
Let’s give some examples:
The goal of “My sensor reaches the number 23” is very local, probably maximally local.
The goal of “Maintain the temperature of the room at 23 °C” is less local, but still focused on a close neighborhood of the system.
The goal of “No death from cancer in the whole world” is even less local.
Locality isn’t about how the system extract a model of the world from its input, but about whether and how much it cares about the world beyond it.
Starting points
This intuition about locality came from the collision of two different classification of goals: the first from from Daniel Dennett and the second from Evan Hubinger.
Thermostats and Goals
In “The Intentional Stance”, Dennett explains, extends and defends… the intentional stance. One point he discusses is his liberalism: he is completely comfortable with admitting ridiculously simple systems like thermostats in the club of intentional systems—to give them meaningful mental states about beliefs, desires and goals.
Lest we readers feel insulted at the comparison, Dennett nonetheless admits that the goals of a thermostat differ from ours.
Going along with the gag, we might agree to grant [the thermostat] the capacity for about half a dozen different beliefs and fewer desires—it can believe the room is too cold or too hot, that the boiler is on or off, and that if it wants the room warmer it should turn on the boiler, and so forth. But surely this is imputing too much to the thermostat; it has no concept of heat or of a boiler, for instance. So suppose we de-interpret its beliefs and desires: it can believe the A is too F or G, and if it wants the A to be more F it should do K, and so forth. After all, by attaching the thermostatic control mechanism to different input and output devices, it could be made to regulate the amount of water in a tank, or the speed of a train, for instance.
The goals and beliefs of a thermostat are thus not about heat and the room it is in, as our anthropomorphic bias might suggest, but about the binary state of its sensor.
Now, if the thermostat had more information about the world—a camera, GPS position, general reasoning ability to infer information about the actual temperature from all its inputs --, then Dennett argues its beliefs and goals would be much more related to heat in the room.
The more of this we add, the less amenable our device becomes to serving as the control structure of anything other than a room-temperature maintenance system. A more formal way of saying this is that the class of indistinguishably satisfactory models of the formal system embodied in its internal states gets smaller and smaller as we add such complexities; the more we add, the richer or more demanding or specific the semantics of the system, until eventually we reach systems for which a unique semantic interpretation is practically (but never in principle) dictated (cf. Hayes 1979). At that point we say this device (or animal or person) has beliefs about heat and about this very room, and so forth, not only because of the system’s actual location in, and operations on, the world, but because we cannot imagine an-other niche in which it could be placed where it would work.
Humans, Dennett argues, are more like this enhanced thermostat, in that our beliefs and goals intertwine with the state of the world. Or put differently, when the world around us changes, it will influence almost always influence our mental states; whereas a basic thermostat might react the exact same way in vastly different environments.
But as systems become perceptually richer and behaviorally more versatile, it becomes harder and harder to make substitutions in the actual links of the system to the world without changing the organization of the system itself. If you change its environment, it will notice, in effect, and make a change in its internal state in response. There comes to be a two-way constraint of growing specificity between the device and the environment. Fix the device in any one state and it demands a very specific environment in which to operate properly (you can no longer switch it easily from regulating temperature to regulating speed or anything else); but at the same time, if you do not fix the state it is in, but just plonk it down in a changed environment, its sensory attachments will be sensitive and discriminative enough to respond appropriately to the change, driving the system into a new state, in which it will operate effectively in the new environment.
Part of this distinction between goals comes from generalization, a property considered necessary for goal-directedness since Rohin’s initial post on the subject. But the two goals also differs in their “groundedness”: the thermostat’s goal lies completely in its sensors’ inputs, whereas the goals of humans depend on things farther away, on the environment itself.
That is, these two goals have different locality.
Goals Across Cartesian Boundaries
The other classification of goals comes from Evan Hubinger, in a personal discussion. Assuming a Cartesian Boundary outlining the system and its inputs and outputs, goals can be functions of:
The environment. This includes most human goals, since we tend to refuse wireheading. Hence the goal depends on something else than our brain state.
The input. A typical goal as a function of the input is the one ascribed to the simple thermostat: maintaining the number given by its sensor above some threshold. If we look at the thermostat without assuming that its goal is a proxy for something else, then this system would happily wirehead itself, as the goal IS the input.
The output. This one is a bit weirder, but captures goals about actions: for example, the goal of twitching. If there is a robot that only twitches, not even trying to keep twitching, just twitching, its goal seems about its output only.
The internals. Lastly, goals can depend on what happens inside the system. For example, a very depressed person might have the goal of “Feeling good”. If that is the only thing that matters, then it is a goal about their internal state, and nothing else.
Of course, many goals are functions of multiple parts of this quatuor. Yet separating them allows a characterization of a given goal through their proportions.
Going back to Dennett’s example, the basic thermostat’s goal is a function of its input, while human goals tend to be functions of the environment. And once again, an important aspect of the difference appears to lie in how far from the system is there information relevant to the goal—locality.
What Is Locality Anyway?
Assuming some model of the world (possibly a causal DAG) containing the system, the locality of the goal is inversely proportional to the minimum radius of a ball, centered at the system, which suffice to evaluate the goal. Basically, one needs to look a certain distance away to check whether one’s goal is accomplished; locality is a measure of this distance. The more local a goal, the less grounded in the environment, and the most it is susceptible to wireheading or change of environment without change of internal state.
Running with this attempt at formalization, a couple of interesting point follow:
If the model of the world includes time, then locality also captures how far in the future and in the past one must go to evaluate the goal. This is basically the short-sightedness of a goal, as exemplified by variants of twitching robots: the robot that simply twitches; the one that want to maximize its twitch in the next second; the one that want to maximize its twitching in the next 2 seconds,… up to the robot that want to maximize the time it twitches in the future.
Despite the previous point, locality differs from the short term/long term split. An example of a short-term goal (or one-shot goal) is wanting an ice cream: after its accomplishment, the goal simply dissolves. Whereas an example of a long-term goal (or continuous goal) is to bring about and maintaing world peace—something that is never over, but instead constrains the shape of the whole future. Short-sightedness differs from short-term, as a short-sighted goal can be long-term: “for all times t (in hours to simplify), I need to eat an ice cream in the interval [t-4,t+4]”.
Where we put the center of the ball inside the system is probably irrelevant, as the classes of locality should matter more than the exact distance.
An alternative definition would be to allow the center of the ball to be anywhere in the world, and make locality inversely proportional to the sum of the distance of the center to the system plus the radius. This captures goals that do not depend on the state of the system, but would give similar numbers than the initial definition.
In summary, locality is a measure of the distance at which information about the world matters for a system’s goal. It appears in various guises in different classification of goals, and underlies multiple safety issues. What I give is far from a formalization; it is instead a first exploration of the concept, with open directions to boot. Yet I believe that the concept can be put into more formal terms, and that such a measure of locality captures a fundamental aspect of goal-directedness.
Thanks to Victoria Krakovna, Evan Hubinger and Michele Campolo for discussions on this idea.
- Literature Review on Goal-Directedness by 18 Jan 2021 11:15 UTC; 80 points) (
- Modeling Risks From Learned Optimization by 12 Oct 2021 20:54 UTC; 45 points) (
- Why You Should Care About Goal-Directedness by 9 Nov 2020 12:48 UTC; 38 points) (
- Problems Involving Abstraction? by 20 Oct 2020 16:49 UTC; 30 points) (
- 14 Jul 2020 21:22 UTC; 13 points) 's comment on Arguments against myopic training by (
- [AN #107]: The convergent instrumental subgoals of goal-directed agents by 16 Jul 2020 6:47 UTC; 13 points) (
- Goal-Directedness: What Success Looks Like by 16 Aug 2020 18:33 UTC; 9 points) (
- 12 Nov 2020 0:35 UTC; 1 point) 's comment on Time in Cartesian Frames by (
- 3 Jul 2020 18:32 UTC; 1 point) 's comment on Goals and short descriptions by (
Nice post!
One related thing I was thinking about last week: part of the idea of abstraction is that we can pick a Markov blanket around some variable X, and anything outside that Markov blanket can only “see” abstract summary information f(X). So, if we have a goal which only cares about things outside that Markov blanket, then that goal will only care about f(X) rather than all of X. This holds for any goal which only cares about things outside the blanket. That sounds like instrumental convergence: any goal which does not explicitly care about things near X itself, will care only about controlling f(X), not all of X.
This isn’t quite the same notion of goal-locality that the OP is using (it’s not about how close the goal-variables are to the agent), but it feels like there’s some overlapping ideas there.
The more I think about it, the more I come to believe that locality is very related to abstraction. Not the distance part necessarily, but the underlying intuition. If my goal is not “about the world”, then I can throw almost all information about the world except a few details and still be able to check my goal. The “world” of the thermostat is in that sense a very abstracted map of the world where anything except the number on its sensor is thrown away.
Thanks! Glad that I managed to write something that was not causally or rhetorically all wrong. ^^
That makes even more sense to me than you might think. My intuitions about locality comes from its uses in distributed computing, where it measures both how many rounds of communication are needed to solve a problem and how far in the communication graph one needs to look to compute one’s own output. This looks like my use of locality here.
On the other hand, recent work on distributed complexity also studied the volume complexity of a problem: the size of the subgraph one needs to look at, which might be very different from a ball. The only real constraint is connectedness. Modulo the usual “exactness issue”, which we can deal with by replacing “the node is not used” by “only f(X) is used”, this looks a lot like your idea.
Planned summary for the Alignment Newsletter:
Thanks for the summary! It’s representative of the idea.
Just by curiosity, how do you decide for which posts/paper you want to write an opinion?
I ask myself if there’s anything in particular I want to say about the post / paper that the author(s) didn’t say, with an emphasis on ensuring that the opinion has content. If yes, then I write it.
(Sorry, that’s not very informative, but I don’t really have a system for it.)
No worries, that’s a good answer. I was just curious, not expecting a full-fledged system. ;)
[I’m not sure I’m understanding correctly, so do correct me where I’m not getting your meaning. Pre-emptive apologies if much of this gets at incidental details and side-issues]
The idea seems interesting, and possibly important.
Some thoughts:
(1) Presumably you mean to define locality as the distance (our distance) that the system would (?need to?) look to check its own goal. The distance we’d need to look doesn’t seem safety relevant, since that doesn’t tell us anything about system behaviour.
So we need to reason within the system’s own model to understand ‘where’ it needs to look—but we need to ground that ‘where’ in our world model to measure the distance.
Let’s say we can see that a system X has achieved its goal by our looking at its local memory state (within 30cm of X). However, X must check another memory location (200 miles away in our terms) to know that it’s achieved its goal.
In that case, I assume: Locality = 1 / (200 miles) ??
(I don’t think it’s helpful to use: Locality = 1 / (30cm), if the system’s behaviour is to exert influence over 200 miles)
(2) I don’t see a good way to define locality in general (outside artificially simple environments), since for almost all goals the distance to check a goal will be contingent on the world state. The worst-case distance will often be unbounded. E.g. “Keep this room above 23 degrees” isn’t very local if someone moves the room to the other side of the continent, or splits it into four pieces and moves each into a separate galaxy.
This applies to the system itself too. The system’s memory can be put on the other side of the galaxy, or split up.… (if you’d want to count these as having low distance from the system, then this would be a way to cheat for any environmental goal: split up the system and place a part of it next to anything in the environment that needs to be tested)
You’d seem to need some caveats to rule out weird stuff, and even then you’d probably end up with categories: either locality zero (for almost any environmental goal), or locality around 1 (for any input/output/internal goal).
If things go that way, I’m not sure having a number is worthwhile.
(3a) Where there’s uncertainty over world state, it might be clearer to talk in terms of probabilistic thresholds.
E.g. my goal of eating ice cream doesn’t dissolve, since I never know I’ve eaten an ice cream. In my world model, the goal of eating an ice cream *with certainty* has locality zero, since I can search my entire future light-cone and never be sure I achieved that goal (e.g. some crafty magician, omega, or a VR machine might have deceived me).
I think you’d need to parameterise locality:
To know whether you’ve achieved g with probability > p, you’d need to look (1/locality) meters.
Then a relevant safety question is the level of certainty the system will seek.
(3b) Once things are uncertain, you’d need a way to avoid most goal-checking being at near-zero distance: a suitable system can check most goals by referring to its own memory. For many complex goals that’s required, since it can’t simultaneously perceive all the components. The goal might not be “make my memory reflect this outcome”, but “check that my memory reflects this outcome” is a valid test (given that the system tends not to manipulate its memory to perform well on tests).
(4) I’m not sure it makes sense to rush to collapse locality into one dimension. In general we’ll be interested in some region (perhaps not a connected region), not only in a one-dimensional representation of that region.
Currently, caring about the entire galaxy gets the same locality value as caring about one vase (or indeed one memory location) that happens to be on the other side of the galaxy. Splitting a measure of displacement from a measure of region size might help here.
If you want one number, I think I’d go with something focused on the size of the goal-satisfying region. Maybe something like:
1 / [The minimum over the sum of radii of balls in (some set of balls of minimum radius k, such that any information needed to check the goal is contained within at least one of the balls)]
(5) I’m not sure humans do tend to avoid wireheading. What we tend to avoid is intentionally and explicitly choosing to wirehead. If it happens without our attention, I don’t think we avoid it by default.
Self-deception is essentially wire-heading; if we think that’s unusual, we’re deceiving ourselves :)
This is important, since it highlights that we should expect wireheading by default. It’s not enough for a highly capable system not to be actively aiming to wirehead. To avoid accidental/side-effect wireheading, a system will need to be actively searching for evidence, and thoroughly analysing its input for wireheading signs.
Another way to think about this:
There aren’t actually any “environment” goals.
”Environment-based” is just a shorthand for: (input + internal state + output)-based
So to say a goal is environment-based, is just to say that we’re giving ourselves the maximal toolkit to avoid wireheading. We should expect wireheading unless we use that toolkit well.
Do you agree with this? If not, what exactly do you mean by “(a function of) the environment”?
If so, then from the system’s point of view, isn’t locality always about 1: since it can only ever check (input + internal state + output)? Or do we care about the distance over which the agent must have interacted in gathering the required information? I still don’t see a clean way to define this without a load of caveats.
Overall, if the aim is to split into “environmental” and “non-environmental” goals, I’m not sure I think that’s a meaningful distinction—at least beyond what I’ve said above (that you can’t call a goal “environmental” unless it depends on all of input, internal-state and output).
I think our position is that of complex thermostats with internal state.