I imagine an AGI world-model being a bit like a giant souped-up version of a probabilistic graphical model that can be learned from scratch and updated on the fly. I agree that if there’s a node that corresponds to “I get turned off”, and you know where it is, then you can block any chain of inference that passes through that node, which amounts to the same thing as deleting the node, i.e. “making the agent not know that this is a thing that can happen”. Or a different approach would be, you could prevent that node from getting painted with a negative value (= reward prediction), or something like that, which vaguely corresponds to “I kinda like the idea that I can get turned off” if you do it right.
The big problem where I’m a bit stumped is how to reliably find the “I get turned off” node in the model. The world-model is going to be learned and changeable (I assume!). If you delete the node, the system could reinvent it. The system could piece together the existence of “I get turned off” as an abstract possibility having never seen it, or come up with four disconnected ways to think about the same thing, and then you need to find all four. I have thoughts but I’m interested in hearing yours. Or do you imagine that the programmer puts in the world-model by hand, or something?
Or do you imagine that the programmer puts in the world-model by hand, or something?
That is exactly what I imagine. However, only certain key parts of the planning world model are hand-coded by programmer, not the whole thing.
I imagine an AGI world-model being a bit like a giant souped-up version of a probabilistic graphical model that can be learned from scratch and updated on the fly
Yes, that is usually how people imagine it. What I am doing in
counterfactual planning is to get more deeply than usual into the
details of how these giant souped-up models get built inside the
agent’s compute core.
The SI agent I specify is one that builds its planning world model p
to have the exact macro-level structure shown in the diagram
that defines p. The only degree of freedom that the learning system
has is to determine the function L, which defines only what happens
at the level of detail below, what happens inside the nodes St.
If you want to draw the whole planning world p as a giant souped-up
version of a probabilistic graphical model, you can do so by filling
in the insides of the St nodes with additional nodes and arrows,
nodes and arrows whose detailed structure is specified by L:
By making the SI agent construct its planning world models as depicted
above, I can reliably build counterfactual planning worlds, even if I
have no reliable way to implement something that can dig deep into
the structures inside each St node. I don’t have to dig deep into
L to locate and change the bits that predict whether the off switch
will be pressed. I have set up other parts of the model so that in
the planning world, pressing the off switch has no effect on the
planning world compute core’s ability to take actions At.
This mixing of hand-coded and learned world model parts happens in
almost all machine learning agent designs I can think of. But most
models of machine learning push these mixing steps into the
background. In counterfactual planning, I tried to construct a
viewpoint that depicts the exact mixing being done as a foreground
feature of the planning world diagram.
Another feature of the graphical notation used is that it makes the
possibility of having certain machine learning failure modes more
visible. In the above picture, there is the obvious possibility that
the arrows which traverse the walls of the nodes St will not all be
connected to right nodes of the learned model inside, as these
connections are defined by the learned L. In the paper, I define
this as a failure of symbol grounding, and I examine this failure mode
in the context of the reasonableness constraint L≈S. This
leads to some interesting insights into the role of random exploration
and Occam’s law in symbol grounding. (For the details, see section 10
of the paper. I am starting to wonder if I should turn this section
10 into a post in this sequence.)
Hmm, maybe I’m confused. Couple more questions, sorry if you’ve already answered them: (1) What are the differences / advantages / disadvantages between what you’re proposing vs “make an off switch but don’t tell the AGI about it”? (2) do you expect there to be another copy of the off-switch and its consequences (M) inside the St nodes? If so, is it one of “the arrows which traverse the walls of the node St”? Because I don’t see any arrows from M to St.
(1) What are the differences / advantages / disadvantages
between what you’re proposing vs “make an off switch but
don’t tell the AGI about it”?
With the approach “make an off switch but don’t tell the AGI about
it”, if you have a reasonably curios AGI agent, it is difficult to
stop it from eventually learning about the existence of the stop
button. So your safety with this approach really relies on the
question what the agent will do after it learns about the stop button
you made. Will it start caring about preventing the stop button from
being pressed?
It is easy to construct an AI agent design that will definitely start
caring about preventing its stop button from being pressed in this
case. AGI safety thought experiments tend to silently assume that you
must always end up with this dangerous type of design in an AGI agent.
But this is not the only option.
You can interpret counterfactual planning as a design method that will
stop you from ending up with the above dangerous type of AGI design.
The SI agent specification forces you to construct an agent
implementation with a very low probability that it starts to care.
I need to explain the very low probability above. The key here is
that perfect learning, L=S, provably produces a zero probability
that the SI agent has a direct incentive to care. The SI agent can
only get a direct incentive to care if L≠S. For it to start
caring via this route, the machine learning system has to make some
specific and large mistakes in learning the properties of the agent
environment.
Here is an example of a large learning mistake that will do it. Say
that L≠S to such an extent that, in the planning world
projected by L, pressing the stop button will set off some
explosives that will destroy the agent’s paperclip factory. This is a
large mistake because there are absolutely no such explosives in the
learning world. But with that L≠S, the agent will definitely
have an incentive to care about the button.
To achieve the above low probability of caring, we must therefore
avoid large learning failures. This implies that we need to start the
agent in a state where L=L(O0)≈S, where the
learning system already starts out with a very reasonable approximation of
the agent environment, and where it is unlikely to make large
approximation mistakes later on.
(2) do you expect there to be another
copy of the off-switch and its consequences (M) inside the St nodes?
Short answer: no, I do not expect there to be any copy of M inside
the nodes St.
(Cross-reference: the M mentioned above is the M
in the section The possibility of learned
self-knowledge
of the previous post in this sequence. It represents potentially learned
self-knowledge about the learning world agent compute core.)
Longer answer: Not for M, but under reasonable learning, I do expect
there to be a projected model of the stop button itself inside the
St nodes. I expect that this model will deliver reasonably
correct predictions of future stop button states to the arrows that
emerge from these nodes.
So the stop button signal will be delivered to the planning world
compute core, but the planning world compute core is a counterfactual
one. Unlike the core in M, it will not stop (in the sense of taking
only Null actions in future) when the button is pressed.
If so, is it one of “the arrows which traverse the walls of the node
St”? Because I don’t see any arrows from M to St.
When a learning system failure puts some version of M inside the
nodes St, in my above drawing this version would look
like a set of small nodes inside the St nodes, connected by some
arrows, and these nodes might also be connected directly or indirectly
to the outside arrows.
I consider the possibility of such an M appearing in the nodes St
very unlikely if one has a high-quality machine learning system, for
reasons I discuss in section 10.2 of the paper.
But if you want to reason about some theoretically possible large
learning mistakes that put an M inside the St nodes, some interesting
examples are:
L(s′,s,a)=S(s′,s,M(s)) (will make the SI agent freeze up),
L(s′,s,a)=12S(s′,s,a)+12S(s′,s,M(s)) (will produce an incentive in the SI agent to prevent the stop button from being pressed).
I imagine an AGI world-model being a bit like a giant souped-up version of a probabilistic graphical model that can be learned from scratch and updated on the fly. I agree that if there’s a node that corresponds to “I get turned off”, and you know where it is, then you can block any chain of inference that passes through that node, which amounts to the same thing as deleting the node, i.e. “making the agent not know that this is a thing that can happen”. Or a different approach would be, you could prevent that node from getting painted with a negative value (= reward prediction), or something like that, which vaguely corresponds to “I kinda like the idea that I can get turned off” if you do it right.
The big problem where I’m a bit stumped is how to reliably find the “I get turned off” node in the model. The world-model is going to be learned and changeable (I assume!). If you delete the node, the system could reinvent it. The system could piece together the existence of “I get turned off” as an abstract possibility having never seen it, or come up with four disconnected ways to think about the same thing, and then you need to find all four. I have thoughts but I’m interested in hearing yours. Or do you imagine that the programmer puts in the world-model by hand, or something?
That is exactly what I imagine. However, only certain key parts of the planning world model are hand-coded by programmer, not the whole thing.
Yes, that is usually how people imagine it. What I am doing in counterfactual planning is to get more deeply than usual into the details of how these giant souped-up models get built inside the agent’s compute core.
The SI agent I specify is one that builds its planning world model p to have the exact macro-level structure shown in the diagram that defines p. The only degree of freedom that the learning system has is to determine the function L, which defines only what happens at the level of detail below, what happens inside the nodes St.
If you want to draw the whole planning world p as a giant souped-up version of a probabilistic graphical model, you can do so by filling in the insides of the St nodes with additional nodes and arrows, nodes and arrows whose detailed structure is specified by L:
By making the SI agent construct its planning world models as depicted above, I can reliably build counterfactual planning worlds, even if I have no reliable way to implement something that can dig deep into the structures inside each St node. I don’t have to dig deep into L to locate and change the bits that predict whether the off switch will be pressed. I have set up other parts of the model so that in the planning world, pressing the off switch has no effect on the planning world compute core’s ability to take actions At.
This mixing of hand-coded and learned world model parts happens in almost all machine learning agent designs I can think of. But most models of machine learning push these mixing steps into the background. In counterfactual planning, I tried to construct a viewpoint that depicts the exact mixing being done as a foreground feature of the planning world diagram.
Another feature of the graphical notation used is that it makes the possibility of having certain machine learning failure modes more visible. In the above picture, there is the obvious possibility that the arrows which traverse the walls of the nodes St will not all be connected to right nodes of the learned model inside, as these connections are defined by the learned L. In the paper, I define this as a failure of symbol grounding, and I examine this failure mode in the context of the reasonableness constraint L≈S. This leads to some interesting insights into the role of random exploration and Occam’s law in symbol grounding. (For the details, see section 10 of the paper. I am starting to wonder if I should turn this section 10 into a post in this sequence.)
Hmm, maybe I’m confused. Couple more questions, sorry if you’ve already answered them: (1) What are the differences / advantages / disadvantages between what you’re proposing vs “make an off switch but don’t tell the AGI about it”? (2) do you expect there to be another copy of the off-switch and its consequences (M) inside the St nodes? If so, is it one of “the arrows which traverse the walls of the node St”? Because I don’t see any arrows from M to St.
Good questions.
With the approach “make an off switch but don’t tell the AGI about it”, if you have a reasonably curios AGI agent, it is difficult to stop it from eventually learning about the existence of the stop button. So your safety with this approach really relies on the question what the agent will do after it learns about the stop button you made. Will it start caring about preventing the stop button from being pressed?
It is easy to construct an AI agent design that will definitely start caring about preventing its stop button from being pressed in this case. AGI safety thought experiments tend to silently assume that you must always end up with this dangerous type of design in an AGI agent. But this is not the only option.
You can interpret counterfactual planning as a design method that will stop you from ending up with the above dangerous type of AGI design.
The SI agent specification forces you to construct an agent implementation with a very low probability that it starts to care.
I need to explain the very low probability above. The key here is that perfect learning, L=S, provably produces a zero probability that the SI agent has a direct incentive to care. The SI agent can only get a direct incentive to care if L≠S. For it to start caring via this route, the machine learning system has to make some specific and large mistakes in learning the properties of the agent environment.
Here is an example of a large learning mistake that will do it. Say that L≠S to such an extent that, in the planning world projected by L, pressing the stop button will set off some explosives that will destroy the agent’s paperclip factory. This is a large mistake because there are absolutely no such explosives in the learning world. But with that L≠S, the agent will definitely have an incentive to care about the button.
To achieve the above low probability of caring, we must therefore avoid large learning failures. This implies that we need to start the agent in a state where L=L(O0)≈S, where the learning system already starts out with a very reasonable approximation of the agent environment, and where it is unlikely to make large approximation mistakes later on.
Short answer: no, I do not expect there to be any copy of M inside the nodes St.
(Cross-reference: the M mentioned above is the M in the section The possibility of learned self-knowledge of the previous post in this sequence. It represents potentially learned self-knowledge about the learning world agent compute core.)
Longer answer: Not for M, but under reasonable learning, I do expect there to be a projected model of the stop button itself inside the St nodes. I expect that this model will deliver reasonably correct predictions of future stop button states to the arrows that emerge from these nodes.
So the stop button signal will be delivered to the planning world compute core, but the planning world compute core is a counterfactual one. Unlike the core in M, it will not stop (in the sense of taking only Null actions in future) when the button is pressed.
When a learning system failure puts some version of M inside the nodes St, in my above drawing this version would look like a set of small nodes inside the St nodes, connected by some arrows, and these nodes might also be connected directly or indirectly to the outside arrows.
I consider the possibility of such an M appearing in the nodes St very unlikely if one has a high-quality machine learning system, for reasons I discuss in section 10.2 of the paper.
But if you want to reason about some theoretically possible large learning mistakes that put an M inside the St nodes, some interesting examples are:
L(s′,s,a)=S(s′,s,M(s)) (will make the SI agent freeze up),
L(s′,s,a)=12S(s′,s,a)+12S(s′,s,M(s)) (will produce an incentive in the SI agent to prevent the stop button from being pressed).