Koen.Holtman comments on Creating AGI Safety Interlocks

Koen.Holtman 11 Feb 2021 18:26 UTC
LW: 1 AF: 1
AF
Good questions.

(1) What are the differences / advantages / disadvantages between what you’re proposing vs “make an off switch but don’t tell the AGI about it”?

With the approach “make an off switch but don’t tell the AGI about it”, if you have a reasonably curios AGI agent, it is difficult to stop it from eventually learning about the existence of the stop button. So your safety with this approach really relies on the question what the agent will do after it learns about the stop button you made. Will it start caring about preventing the stop button from being pressed?

It is easy to construct an AI agent design that will definitely start caring about preventing its stop button from being pressed in this case. AGI safety thought experiments tend to silently assume that you must always end up with this dangerous type of design in an AGI agent. But this is not the only option.

You can interpret counterfactual planning as a design method that will stop you from ending up with the above dangerous type of AGI design.

The SI agent specification forces you to construct an agent implementation with a very low probability that it starts to care.

I need to explain the very low probability above. The key here is that perfect learning, $L = S$ , provably produces a zero probability that the SI agent has a direct incentive to care. The SI agent can only get a direct incentive to care if $L \neq S$ . For it to start caring via this route, the machine learning system has to make some specific and large mistakes in learning the properties of the agent environment.

Here is an example of a large learning mistake that will do it. Say that $L \neq S$ to such an extent that, in the planning world projected by $L$ , pressing the stop button will set off some explosives that will destroy the agent’s paperclip factory. This is a large mistake because there are absolutely no such explosives in the learning world. But with that $L \neq S$ , the agent will definitely have an incentive to care about the button.

To achieve the above low probability of caring, we must therefore avoid large learning failures. This implies that we need to start the agent in a state where $L = L (O_{0}) \approx S$ , where the learning system already starts out with a very reasonable approximation of the agent environment, and where it is unlikely to make large approximation mistakes later on.

(2) do you expect there to be another copy of the off-switch and its consequences (M) inside the St nodes?

Short answer: no, I do not expect there to be any copy of $M$ inside the nodes $S_{t}$ .

(Cross-reference: the $M$ mentioned above is the $M$ in the section The possibility of learned self-knowledge of the previous post in this sequence. It represents potentially learned self-knowledge about the learning world agent compute core.)

Longer answer: Not for $M$ , but under reasonable learning, I do expect there to be a projected model of the stop button itself inside the $S_{t}$ nodes. I expect that this model will deliver reasonably correct predictions of future stop button states to the arrows that emerge from these nodes.

So the stop button signal will be delivered to the planning world compute core, but the planning world compute core is a counterfactual one. Unlike the core in $M$ , it will not stop (in the sense of taking only $N u l l$ actions in future) when the button is pressed.

If so, is it one of “the arrows which traverse the walls of the node St”? Because I don’t see any arrows from M to St.

When a learning system failure puts some version of $M$ inside the nodes $S_{t}$ , in my above drawing this version would look like a set of small nodes inside the $S_{t}$ nodes, connected by some arrows, and these nodes might also be connected directly or indirectly to the outside arrows.

I consider the possibility of such an $M$ appearing in the nodes $S_{t}$ very unlikely if one has a high-quality machine learning system, for reasons I discuss in section 10.2 of the paper.

But if you want to reason about some theoretically possible large learning mistakes that put an $M$ inside the $S_{t}$ nodes, some interesting examples are:
- $L (s^{'}, s, a) = S (s^{'}, s, M (s))$ (will make the SI agent freeze up),
- $L (s^{'}, s, a) = \frac{1}{2} S (s^{'}, s, a) + \frac{1}{2} S (s^{'}, s, M (s))$ (will produce an incentive in the SI agent to prevent the stop button from being pressed).