Premise (as stated by John): “a system steers far-away parts of the world into a relatively-small chunk of their state space”
Desired conclusion: The system is very likely (probability approaching 1 with increasing model size / optimization power / whatever) consequentialist, in that it has an internal world-model and search process.
But why would we expect this to be true? What are the intuitions informing this conjecture?
Off the top of my head:
The system would need to have a world-model because controlling the world at some far-away point requires (1) controlling the causal path between the system and that point, and (2) offsetting the effects of the rest of the world on that point, both of which require predicting and counteracting-in-advance the world’s effects. And any internal process which lets a system do that would be a “world model”, by definition.
The system would need some way to efficiently generate actions that counteract the world’s intrusions upon what it cares about, despite being embedded in the world (i. e., having limited knowledge, processing capacity, memory) and frequently encountering attacks it’s never encountered before. Fortunately, there seem to be some general-purpose algorithm(s) that can “extract” solutions to novel problems from a world-model in a goal-agnostic fashion, and under embeddedness/real-world conditions, optimizing systems are likely/required to converge to it.
Put like this, there are obvious caveats needed to be introduced. For example, the “optimizing system + optimization target” system needs to be “fully exposed” to the world — if we causally isolate it from most/all of the world (e. g., hide it behind an event horizon), many optimization targets would not require the optimizing system to be a generalist consequentialist with a full world-model. Simple heuristics that counteract whatever limited external influences remain would suffice.
Another way to look at this is that the system needs to be steering against powerful counteracting influences, or have the potential to arrive at strategies to steer against arbitrary/arbitrarily complex counteracting influences. Not just a thermostat, but a thermostat that can keep the temperature in a room at a certain level even if there’s a hostile human civilization trying to cool the room down, stuff like this.
Which can probably all be encompassed by a single change to the premise: “a system robustly steers far-away parts of the world into a relatively-small chunk of their state space”.
I actually originally intended robustness to be part of the problem statement, and I was so used to that assumption that I didn’t notice until after writing the post that Thomas’ statement of the problem didn’t mention it. So thank you for highlighting it!
Also, in general, it is totally fair game for a proposed solution to the problem to introduce some extra conditions (like robustness). Of course there’s a very subjective judgement call about whether a condition too restrictive for a proof/disproof to “count”, but that’s the sort of thing where a hindsight judgement call is in fact important and a judge should think it through and put in some effort to explain their reasoning.
But why would we expect this to be true? What are the intuitions informing this conjecture?
Off the top of my head:
The system would need to have a world-model because controlling the world at some far-away point requires (1) controlling the causal path between the system and that point, and (2) offsetting the effects of the rest of the world on that point, both of which require predicting and counteracting-in-advance the world’s effects. And any internal process which lets a system do that would be a “world model”, by definition.
The system would need some way to efficiently generate actions that counteract the world’s intrusions upon what it cares about, despite being embedded in the world (i. e., having limited knowledge, processing capacity, memory) and frequently encountering attacks it’s never encountered before. Fortunately, there seem to be some general-purpose algorithm(s) that can “extract” solutions to novel problems from a world-model in a goal-agnostic fashion, and under embeddedness/real-world conditions, optimizing systems are likely/required to converge to it.
Put like this, there are obvious caveats needed to be introduced. For example, the “optimizing system + optimization target” system needs to be “fully exposed” to the world — if we causally isolate it from most/all of the world (e. g., hide it behind an event horizon), many optimization targets would not require the optimizing system to be a generalist consequentialist with a full world-model. Simple heuristics that counteract whatever limited external influences remain would suffice.
Another way to look at this is that the system needs to be steering against powerful counteracting influences, or have the potential to arrive at strategies to steer against arbitrary/arbitrarily complex counteracting influences. Not just a thermostat, but a thermostat that can keep the temperature in a room at a certain level even if there’s a hostile human civilization trying to cool the room down, stuff like this.
Which can probably all be encompassed by a single change to the premise: “a system robustly steers far-away parts of the world into a relatively-small chunk of their state space”.
I actually originally intended robustness to be part of the problem statement, and I was so used to that assumption that I didn’t notice until after writing the post that Thomas’ statement of the problem didn’t mention it. So thank you for highlighting it!
Also, in general, it is totally fair game for a proposed solution to the problem to introduce some extra conditions (like robustness). Of course there’s a very subjective judgement call about whether a condition too restrictive for a proof/disproof to “count”, but that’s the sort of thing where a hindsight judgement call is in fact important and a judge should think it through and put in some effort to explain their reasoning.