Suppose your looking at an AI that is currently placed in a game of chess.
It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance.
You could describe the actions of this AI by producing a giant table of “behaviours”. Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. …
But there is a more compact way to represent similar predictions. You can say it’s trying to win at chess.
The “trying to win at chess” model makes a bunch of predictions that the giant list of behaviour model doesn’t.
Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)
The list of behaviours model has no reason to suspect the AI also has a “promote pawn to knight” behaviour.
Put the AI in a circumstance where such promotion is a good move, and the “trying to win” model makes it as a clear prediction.
Now it’s possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games.
But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And “do whichever move will win” is a simple and general pattern.
Now on to making snarky remarks about the arguments in this post.
There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior.
There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit.
Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape.
Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?
Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,
This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don’t need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it’s implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it’s compute. A module based system can’t use any optimization that humans didn’t design into it’s interfaces.
Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models.
In general, full [separation between goal and goal-achieving engine]and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals.
That is not how this works. That is not how any of this works.
Back to our chess AI. Lets say it’s a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals.
I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn’t wirehead by giving itself a really easy goal would be tricky.
For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI’s doing various tasks will be able to invent highly effective and novel strategies. The exact “goal” they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn’t matter to us which it “really wants”.
What are you expecting. An AI that says “I’m not really sure whether I want flies or black dots. I’ll just sit here not taking over the world and not get either of those things”?
In favour of goal realism
Suppose your looking at an AI that is currently placed in a game of chess.
It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance.
You could describe the actions of this AI by producing a giant table of “behaviours”. Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. …
But there is a more compact way to represent similar predictions. You can say it’s trying to win at chess.
The “trying to win at chess” model makes a bunch of predictions that the giant list of behaviour model doesn’t.
Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)
The list of behaviours model has no reason to suspect the AI also has a “promote pawn to knight” behaviour.
Put the AI in a circumstance where such promotion is a good move, and the “trying to win” model makes it as a clear prediction.
Now it’s possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games.
But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And “do whichever move will win” is a simple and general pattern.
Now on to making snarky remarks about the arguments in this post.
There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit.
Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape.
This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don’t need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it’s implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it’s compute. A module based system can’t use any optimization that humans didn’t design into it’s interfaces.
Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models.
That is not how this works. That is not how any of this works.
Back to our chess AI. Lets say it’s a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals.
I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn’t wirehead by giving itself a really easy goal would be tricky.
For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI’s doing various tasks will be able to invent highly effective and novel strategies. The exact “goal” they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn’t matter to us which it “really wants”.
What are you expecting. An AI that says “I’m not really sure whether I want flies or black dots. I’ll just sit here not taking over the world and not get either of those things”?