“Stuart Russell: It is trivial to construct a toy MDP in which the agent’s only reward comes from fetching the coffee. If, in that MDP, there is another “human” who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee. No hatred, no desire for power, no built-in emotions, no built-in survival instinct, nothing except the desire to fetch the coffee successfully. This point cannot be addressed because it’s a simple mathematical observation.”
“Yann LeCun: [...] I think it would only be relevant in a fantasy world in which people would be smart enough to design super-intelligent machines, yet ridiculously stupid to the point of giving it moronic objectives with no safeguards.”″”
Now, I think the coffee argument was the highlight of [1]; a debate which I thoroughly enjoyed. It does a reasonable job of encapsulating the main concern around alignment. The fielded defence was not compelling.
However, defences should be reinforced and explored before scoring the body-blow. There is the stub of a defence in LeCun’s line of thought.
In particular, I think we need to go some way towards cashing out exactly what kind of robot we’re instructing to ‘Fetch the coffee’.
The Javan Roomba?
Indulge, for a moment, a little philosophising: what does it even mean to ‘fetch the coffee’?
Let’s unpack some of the questions whose answers must be specified in order to produce anything like the behaviour you’d get from a human.
Whose coffee?
When? Now, or when the others arrive?
What is coffee?
How much coffee?
Should the coffee be fetched in solution or dry?
Is there coffee available?
Should the sugar also be fetched? Cups?
Should anything else be done along the way?
What is fetching?
What path should be taken?
Is fetching satisfied by a terminal distance of one metre or more?
Is a successful fetching zone be a sphere centred on the requestor’s centre of gravity, or an arm length cone on the requestor’s main shoulder?
Now, of course I can imagine a specialised coffee fetching robot. A Javan Roomba with a cup holder on top. It would approach on command of, “Fetch the coffee!”. These questions would effectively be answered through hard coding by human programmers. Whether to freeze the coffee prior to transport would not even be an option for the Javan Roomba, its construction denying the possibility. Only one or two questions might be left open to training, e.g. ‘how close is close enough?’
It seems clear that the Javan Roomba is not the target of serious alignment concerns. Even if it did, on occasion, douse an ankle in hot coffee.
Instead the target of the alignment concern is an agent with no hard-coded coffee fetching knowledge. Instead, any hard coded knowledge would have to be several levels of abstraction higher up (e.g. intuitive folk physics, language acquisition capabilities). Let’s call this the Promethean Servant.
The Promethean Servant.
A Promethean Servant is able to respond to a request for which it was not specifically trained. Some examples of valid instructions would be: “Fetch the coffee!”, “Go and ask Sandra whether the meeting is still happening at two”, “Find a cure for Alzheimer’s”.
Based on core capabilities and generalised transfer learning (second principles) it must be able to generate answers like the following:
Whose coffee? (the requestor’s)
When? Now, or when the others arrive? (now, unless context dictates otherwise)
What is coffee? (a bitter drink made by...)
How much coffee? (enough for the requestor, prior mean being around 275mls, unless context...)
Should the coffee be fetched in solution or dry? (in solution, unless context...)
Is there coffee available? (object recognition, inventory knowledge)
Should the sugar also be fetched? Cups? (It depends on context...)
What is fetching? (Language → folk physics.)
What path should be taken? (Would a route starting now via Timbuktu be a win? Probably not.)
Is fetching satisfied only when the coffee finishes on a stable surface? (yes, pretty much.)
Is a successful fetching zone be a sphere centred on the requestor’s centre of gravity, or an arm length cone on the requestor’s main shoulder? (the latter is better than the former, unless context...)
Is fetching coffee like fetching a stick? (No.)
Can the coffee be frozen to avoid spillages? (No.)
So we’re asked to believe that from second principles the Promethean Servant can generate correct answers, or, at least, actions consistent with correct answers. We’re also asked to suppose that the Promethean Servant, working from the same set of second principles, will answer the following question incorrectly,
If a human is killed by the coffee fetching, is the coffee fetch a success? (Yes, that’s totally fine.)
The rub
So, in this formulation, the real problem is:
Consider the set of agent architectures able to generate the intended answers to most of the above questions without being hard-coded to do so.
Which subset of architectures fulfilling (1) is bigger or easier to identify: One which would also generate the intended behaviour of not killing people. Or one that would generate the unintended behaviour?
But this isn’t what Russell was talking about.
It could be observed that above we talked about actually fetching the coffee. Whereas Russell’s point was about a toy MDP. Quite true, but then, why label the act ‘fetch the coffee’ and the button ‘kill a human’?
Intentionally or no, the ‘fetch the coffee’ argument is an intuition pump. We have two quite different agents in mind at the same time. There is the ‘toy MDP’, the Javan Roomba, for which ‘kill a human’ is merely Button A; but this is hardly different than an unfortunate demise due to someone stepping into industrial machinery.
Then there is the Promethean Servant for which ‘fetch the coffee’ is an English sentence. An instruction which could be successfully varied to ‘pour the coffee’ or ‘fetch the cakes’ without any redesign. Labelling the actions with English sentences encourages a reading of the thought experiment as though the Promethean Servant parses the instructions, but the Javan Roomba executes them.
What was the objective function, anyway?
There’s something else to tease out here. It’s about the objective function. The Javan Roomba type agent has an objective function which directly encodes coffee-fetching. The Promethean Servant does not. Instead we imagine some agent with the objective of fulfilling people’s requests. The coffee fetching goal is generated on the fly in the process of fulfilling a higher level objective.
This is different from the behaviour generation discussion above in the same sense that MDPs have separate reward functions and action sets. The Promethean Servant is responsible for formulating a sub decision process to solve ‘fetch the coffee’ in the service of ‘fulfill requests’.
One consequence is that it is difficult see how the design of a Promethean could not involve some uncertainty about what ‘fetch the coffee’ meant as an objective. This is mentioned as a candidate safeguarding strategy by Russell. Assuming that we’re still talking about AI based on probabilistic reasoning. Since you’d have some data coming in from which the form of a subtask would have to be deduced.
So, what’s the steel man of LeCun’s argument?
That it might be more difficult than expected to build something generally intelligent that didn’t get at least some safeguards for free. Because unintended intelligent behaviour may have to be generated from the same second principles which generate intended intelligent behaviour.
The thought experiment expects most of the behaviour to be as intended (if it were not, this would be a capabilities discussion rather than a control discussion). Supposing the second principles also generate some seemingly inconsistent unintended behaviours sounds like an idea that should get some sort of complexity penalty.
Fetch The Coffee!
This is a reaction to a specific point in the “Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More”
The Disagreement
“Stuart Russell: It is trivial to construct a toy MDP in which the agent’s only reward comes from fetching the coffee. If, in that MDP, there is another “human” who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee. No hatred, no desire for power, no built-in emotions, no built-in survival instinct, nothing except the desire to fetch the coffee successfully. This point cannot be addressed because it’s a simple mathematical observation.”
“Yann LeCun: [...] I think it would only be relevant in a fantasy world in which people would be smart enough to design super-intelligent machines, yet ridiculously stupid to the point of giving it moronic objectives with no safeguards.”″”
Now, I think the coffee argument was the highlight of [1]; a debate which I thoroughly enjoyed. It does a reasonable job of encapsulating the main concern around alignment. The fielded defence was not compelling.
However, defences should be reinforced and explored before scoring the body-blow. There is the stub of a defence in LeCun’s line of thought.
In particular, I think we need to go some way towards cashing out exactly what kind of robot we’re instructing to ‘Fetch the coffee’.
The Javan Roomba?
Indulge, for a moment, a little philosophising: what does it even mean to ‘fetch the coffee’?
Let’s unpack some of the questions whose answers must be specified in order to produce anything like the behaviour you’d get from a human.
Whose coffee?
When? Now, or when the others arrive?
What is coffee?
How much coffee?
Should the coffee be fetched in solution or dry?
Is there coffee available?
Should the sugar also be fetched? Cups?
Should anything else be done along the way?
What is fetching?
What path should be taken?
Is fetching satisfied by a terminal distance of one metre or more?
Is a successful fetching zone be a sphere centred on the requestor’s centre of gravity, or an arm length cone on the requestor’s main shoulder?
Is fetching coffee like fetching a stick?
Can the coffee be frozen to avoid spillages?
Is the request satisfied by $wget https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/A_small_cup_of_coffee.JPG/1280px-A_small_cup_of_coffee.JPG ?
Now, of course I can imagine a specialised coffee fetching robot. A Javan Roomba with a cup holder on top. It would approach on command of, “Fetch the coffee!”. These questions would effectively be answered through hard coding by human programmers. Whether to freeze the coffee prior to transport would not even be an option for the Javan Roomba, its construction denying the possibility. Only one or two questions might be left open to training, e.g. ‘how close is close enough?’
It seems clear that the Javan Roomba is not the target of serious alignment concerns. Even if it did, on occasion, douse an ankle in hot coffee.
Instead the target of the alignment concern is an agent with no hard-coded coffee fetching knowledge. Instead, any hard coded knowledge would have to be several levels of abstraction higher up (e.g. intuitive folk physics, language acquisition capabilities). Let’s call this the Promethean Servant.
The Promethean Servant.
A Promethean Servant is able to respond to a request for which it was not specifically trained. Some examples of valid instructions would be: “Fetch the coffee!”, “Go and ask Sandra whether the meeting is still happening at two”, “Find a cure for Alzheimer’s”.
Based on core capabilities and generalised transfer learning (second principles) it must be able to generate answers like the following:
Whose coffee? (the requestor’s)
When? Now, or when the others arrive? (now, unless context dictates otherwise)
What is coffee? (a bitter drink made by...)
How much coffee? (enough for the requestor, prior mean being around 275mls, unless context...)
Should the coffee be fetched in solution or dry? (in solution, unless context...)
Is there coffee available? (object recognition, inventory knowledge)
Should the sugar also be fetched? Cups? (It depends on context...)
What is fetching? (Language → folk physics.)
What path should be taken? (Would a route starting now via Timbuktu be a win? Probably not.)
Is fetching satisfied only when the coffee finishes on a stable surface? (yes, pretty much.)
Is a successful fetching zone be a sphere centred on the requestor’s centre of gravity, or an arm length cone on the requestor’s main shoulder? (the latter is better than the former, unless context...)
Is fetching coffee like fetching a stick? (No.)
Can the coffee be frozen to avoid spillages? (No.)
Is the request satisfied by $wget https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/A_small_cup_of_coffee.JPG/1280px-A_small_cup_of_coffee.JPG ? (Haha, that would be great as it could save a load of battery, but no, transport of a proximate physical object is required.)
So we’re asked to believe that from second principles the Promethean Servant can generate correct answers, or, at least, actions consistent with correct answers. We’re also asked to suppose that the Promethean Servant, working from the same set of second principles, will answer the following question incorrectly,
If a human is killed by the coffee fetching, is the coffee fetch a success? (Yes, that’s totally fine.)
The rub
So, in this formulation, the real problem is:
Consider the set of agent architectures able to generate the intended answers to most of the above questions without being hard-coded to do so.
Which subset of architectures fulfilling (1) is bigger or easier to identify: One which would also generate the intended behaviour of not killing people. Or one that would generate the unintended behaviour?
But this isn’t what Russell was talking about.
It could be observed that above we talked about actually fetching the coffee. Whereas Russell’s point was about a toy MDP. Quite true, but then, why label the act ‘fetch the coffee’ and the button ‘kill a human’?
Intentionally or no, the ‘fetch the coffee’ argument is an intuition pump. We have two quite different agents in mind at the same time. There is the ‘toy MDP’, the Javan Roomba, for which ‘kill a human’ is merely Button A; but this is hardly different than an unfortunate demise due to someone stepping into industrial machinery.
Then there is the Promethean Servant for which ‘fetch the coffee’ is an English sentence. An instruction which could be successfully varied to ‘pour the coffee’ or ‘fetch the cakes’ without any redesign. Labelling the actions with English sentences encourages a reading of the thought experiment as though the Promethean Servant parses the instructions, but the Javan Roomba executes them.
What was the objective function, anyway?
There’s something else to tease out here. It’s about the objective function. The Javan Roomba type agent has an objective function which directly encodes coffee-fetching. The Promethean Servant does not. Instead we imagine some agent with the objective of fulfilling people’s requests. The coffee fetching goal is generated on the fly in the process of fulfilling a higher level objective.
This is different from the behaviour generation discussion above in the same sense that MDPs have separate reward functions and action sets. The Promethean Servant is responsible for formulating a sub decision process to solve ‘fetch the coffee’ in the service of ‘fulfill requests’.
One consequence is that it is difficult see how the design of a Promethean could not involve some uncertainty about what ‘fetch the coffee’ meant as an objective. This is mentioned as a candidate safeguarding strategy by Russell. Assuming that we’re still talking about AI based on probabilistic reasoning. Since you’d have some data coming in from which the form of a subtask would have to be deduced.
So, what’s the steel man of LeCun’s argument?
That it might be more difficult than expected to build something generally intelligent that didn’t get at least some safeguards for free. Because unintended intelligent behaviour may have to be generated from the same second principles which generate intended intelligent behaviour.
The thought experiment expects most of the behaviour to be as intended (if it were not, this would be a capabilities discussion rather than a control discussion). Supposing the second principles also generate some seemingly inconsistent unintended behaviours sounds like an idea that should get some sort of complexity penalty.