[I’m not sure I’m understanding correctly, so do correct me where I’m not getting your meaning. Pre-emptive apologies if much of this gets at incidental details and side-issues]
The idea seems interesting, and possibly important.
Some thoughts:
(1) Presumably you mean to define locality as the distance (our distance) that the system would (?need to?) look to check its own goal. The distance we’d need to look doesn’t seem safety relevant, since that doesn’t tell us anything about system behaviour.
So we need to reason within the system’s own model to understand ‘where’ it needs to look—but we need to ground that ‘where’ in our world model to measure the distance.
Let’s say we can see that a system X has achieved its goal by our looking at its local memory state (within 30cm of X). However, X must check another memory location (200 miles away in our terms) to know that it’s achieved its goal.
In that case, I assume: Locality = 1 / (200 miles) ??
(I don’t think it’s helpful to use: Locality = 1 / (30cm), if the system’s behaviour is to exert influence over 200 miles)
(2) I don’t see a good way to define locality in general (outside artificially simple environments), since for almost all goals the distance to check a goal will be contingent on the world state. The worst-case distance will often be unbounded. E.g. “Keep this room above 23 degrees” isn’t very local if someone moves the room to the other side of the continent, or splits it into four pieces and moves each into a separate galaxy.
This applies to the system itself too. The system’s memory can be put on the other side of the galaxy, or split up.… (if you’d want to count these as having low distance from the system, then this would be a way to cheat for any environmental goal: split up the system and place a part of it next to anything in the environment that needs to be tested)
You’d seem to need some caveats to rule out weird stuff, and even then you’d probably end up with categories: either locality zero (for almost any environmental goal), or locality around 1 (for any input/output/internal goal).
If things go that way, I’m not sure having a number is worthwhile.
(3a) Where there’s uncertainty over world state, it might be clearer to talk in terms of probabilistic thresholds. E.g. my goal of eating ice cream doesn’t dissolve, since I never know I’ve eaten an ice cream. In my world model, the goal of eating an ice cream *with certainty* has locality zero, since I can search my entire future light-cone and never be sure I achieved that goal (e.g. some crafty magician, omega, or a VR machine might have deceived me).
I think you’d need to parameterise locality: To know whether you’ve achieved g with probability > p, you’d need to look (1/locality) meters.
Then a relevant safety question is the level of certainty the system will seek.
(3b) Once things are uncertain, you’d need a way to avoid most goal-checking being at near-zero distance: a suitable system can check most goals by referring to its own memory. For many complex goals that’s required, since it can’t simultaneously perceive all the components. The goal might not be “make my memory reflect this outcome”, but “check that my memory reflects this outcome” is a valid test (given that the system tends not to manipulate its memory to perform well on tests).
(4) I’m not sure it makes sense to rush to collapse locality into one dimension. In general we’ll be interested in some region (perhaps not a connected region), not only in a one-dimensional representation of that region.
Currently, caring about the entire galaxy gets the same locality value as caring about one vase (or indeed one memory location) that happens to be on the other side of the galaxy. Splitting a measure of displacement from a measure of region size might help here.
If you want one number, I think I’d go with something focused on the size of the goal-satisfying region. Maybe something like: 1 / [The minimum over the sum of radii of balls in (some set of balls of minimum radius k, such that any information needed to check the goal is contained within at least one of the balls)]
(5) I’m not sure humans do tend to avoid wireheading. What we tend to avoid is intentionally and explicitly choosing to wirehead. If it happens without our attention, I don’t think we avoid it by default. Self-deception is essentially wire-heading; if we think that’s unusual, we’re deceiving ourselves :)
This is important, since it highlights that we should expect wireheading by default. It’s not enough for a highly capable system not to be actively aiming to wirehead. To avoid accidental/side-effect wireheading, a system will need to be actively searching for evidence, and thoroughly analysing its input for wireheading signs.
Another way to think about this: There aren’t actually any “environment” goals. ”Environment-based” is just a shorthand for: (input + internal state + output)-based
So to say a goal is environment-based, is just to say that we’re giving ourselves the maximal toolkit to avoid wireheading. We should expect wireheading unless we use that toolkit well.
Do you agree with this? If not, what exactly do you mean by “(a function of) the environment”?
If so, then from the system’s point of view, isn’t locality always about 1: since it can only ever check (input + internal state + output)? Or do we care about the distance over which the agent must have interacted in gathering the required information? I still don’t see a clean way to define this without a load of caveats.
Overall, if the aim is to split into “environmental” and “non-environmental” goals, I’m not sure I think that’s a meaningful distinction—at least beyond what I’ve said above (that you can’t call a goal “environmental” unless it depends on all of input, internal-state and output).
I think our position is that of complex thermostats with internal state.
[I’m not sure I’m understanding correctly, so do correct me where I’m not getting your meaning. Pre-emptive apologies if much of this gets at incidental details and side-issues]
The idea seems interesting, and possibly important.
Some thoughts:
(1) Presumably you mean to define locality as the distance (our distance) that the system would (?need to?) look to check its own goal. The distance we’d need to look doesn’t seem safety relevant, since that doesn’t tell us anything about system behaviour.
So we need to reason within the system’s own model to understand ‘where’ it needs to look—but we need to ground that ‘where’ in our world model to measure the distance.
Let’s say we can see that a system X has achieved its goal by our looking at its local memory state (within 30cm of X). However, X must check another memory location (200 miles away in our terms) to know that it’s achieved its goal.
In that case, I assume: Locality = 1 / (200 miles) ??
(I don’t think it’s helpful to use: Locality = 1 / (30cm), if the system’s behaviour is to exert influence over 200 miles)
(2) I don’t see a good way to define locality in general (outside artificially simple environments), since for almost all goals the distance to check a goal will be contingent on the world state. The worst-case distance will often be unbounded. E.g. “Keep this room above 23 degrees” isn’t very local if someone moves the room to the other side of the continent, or splits it into four pieces and moves each into a separate galaxy.
This applies to the system itself too. The system’s memory can be put on the other side of the galaxy, or split up.… (if you’d want to count these as having low distance from the system, then this would be a way to cheat for any environmental goal: split up the system and place a part of it next to anything in the environment that needs to be tested)
You’d seem to need some caveats to rule out weird stuff, and even then you’d probably end up with categories: either locality zero (for almost any environmental goal), or locality around 1 (for any input/output/internal goal).
If things go that way, I’m not sure having a number is worthwhile.
(3a) Where there’s uncertainty over world state, it might be clearer to talk in terms of probabilistic thresholds.
E.g. my goal of eating ice cream doesn’t dissolve, since I never know I’ve eaten an ice cream. In my world model, the goal of eating an ice cream *with certainty* has locality zero, since I can search my entire future light-cone and never be sure I achieved that goal (e.g. some crafty magician, omega, or a VR machine might have deceived me).
I think you’d need to parameterise locality:
To know whether you’ve achieved g with probability > p, you’d need to look (1/locality) meters.
Then a relevant safety question is the level of certainty the system will seek.
(3b) Once things are uncertain, you’d need a way to avoid most goal-checking being at near-zero distance: a suitable system can check most goals by referring to its own memory. For many complex goals that’s required, since it can’t simultaneously perceive all the components. The goal might not be “make my memory reflect this outcome”, but “check that my memory reflects this outcome” is a valid test (given that the system tends not to manipulate its memory to perform well on tests).
(4) I’m not sure it makes sense to rush to collapse locality into one dimension. In general we’ll be interested in some region (perhaps not a connected region), not only in a one-dimensional representation of that region.
Currently, caring about the entire galaxy gets the same locality value as caring about one vase (or indeed one memory location) that happens to be on the other side of the galaxy. Splitting a measure of displacement from a measure of region size might help here.
If you want one number, I think I’d go with something focused on the size of the goal-satisfying region. Maybe something like:
1 / [The minimum over the sum of radii of balls in (some set of balls of minimum radius k, such that any information needed to check the goal is contained within at least one of the balls)]
(5) I’m not sure humans do tend to avoid wireheading. What we tend to avoid is intentionally and explicitly choosing to wirehead. If it happens without our attention, I don’t think we avoid it by default.
Self-deception is essentially wire-heading; if we think that’s unusual, we’re deceiving ourselves :)
This is important, since it highlights that we should expect wireheading by default. It’s not enough for a highly capable system not to be actively aiming to wirehead. To avoid accidental/side-effect wireheading, a system will need to be actively searching for evidence, and thoroughly analysing its input for wireheading signs.
Another way to think about this:
There aren’t actually any “environment” goals.
”Environment-based” is just a shorthand for: (input + internal state + output)-based
So to say a goal is environment-based, is just to say that we’re giving ourselves the maximal toolkit to avoid wireheading. We should expect wireheading unless we use that toolkit well.
Do you agree with this? If not, what exactly do you mean by “(a function of) the environment”?
If so, then from the system’s point of view, isn’t locality always about 1: since it can only ever check (input + internal state + output)? Or do we care about the distance over which the agent must have interacted in gathering the required information? I still don’t see a clean way to define this without a load of caveats.
Overall, if the aim is to split into “environmental” and “non-environmental” goals, I’m not sure I think that’s a meaningful distinction—at least beyond what I’ve said above (that you can’t call a goal “environmental” unless it depends on all of input, internal-state and output).
I think our position is that of complex thermostats with internal state.