Large Language Models such as GPT-3 and PaLM have demonstrated an ability to generate plausible-sounding, grammatically-correct text in response to a wide range of prompts. And some Language-Model-based applications such as Dall-E or Google’s Imagen can generate images from a text prompt. However, such Language model still frequently make obvious errors or give false information, and there is reason to doubt that such models actually understand the words they generate or the outside world those words describe.
In this post I describe a class of tasks to test if a language model’s (or hybrid language-model-and-vision-system or other AI system performing the task) capabilities can translate into an ability to achieve goals in the physical outside world, in situations where achieving the goal requires planning multiple steps ahead and accurate reasoning about cause and effect.
The Object Manipulation Task
In an object manipulation task, multiple objects are arranged on a table or on some other flat surface, within reach of a robotic arm that has a claw (or mechanical hand or some other mechanical part can that grip, push, or otherwise manipulate the objects) at the end of it. There would be some goal for what to do with the objects, and the robotic arm would be capable of taking a limited range of primitive actions (such as “Go down to the table”, “Move forward 6 inches”, “Close the claw to grip object”, etc). The arrangement would be such that there is some way to use the robotic arm to achieve the task objective.
The language model would be hooked up to the robotic arm and there would be certain text commands that would cause the robotic arm to take one of its primitive actions. A camera would show the entire scene including the position of the robotic arm and any items on the table including those it needs to move.
The initial input for the AI system when it begins the task would include:
A text description of the goal
A text description of the robot arm’s primitive actions and the commands to execute them
A picture or video feed showing the starting position of everything involved
After each prompt, the AI outputs text to tell the arm to make one move (i.e. take one primitive action). Then, after the arm executes the command given by the AI, the next prompt/input would include a picture or video showing the new positions of the arm and objects. If the AI gives an output that does not correspond to any of the arm’s valid action command that the robotic arm has, the arm would not move and the next input would include an error message indicating that the previous output was not a valid command. Finally, there should be some “End Task” text string the AI can output indicating that it thinks it has accomplished the goal. When the model outputs this End Task string, the task would end and the researchers could then evaluate the system’s performance.
Optionally, each prompt could also include a number indicating how many moves have been taken so far or a restatement of the instructions or both.
There could also be some command the AI can output to go into a ”Review-Transcript-So-Far” mode. Such an output would not move the arm but would instead allow the model to go into a mode where it look over the previous history of its performance since the start of the task. When in this mode, there would some other string it needed to output to return to the normal mode and continue the task. (Though this review mode is arguably something that should be built into the AI system itself, assuming the AI has enough memory to download and store the full history of its prior performance, rather than having specific commands to change modes).
An example of an simple version of the type of task the AI would need to perform would be as follows:
On the table there are:
3 Blue Cubes
3 Red Cubes
3 Blue Pyramid-Shaped Objects
3 Red Pyramid-Shaped Objects
4 Buckets with the numbers 1 through 4 displayed on them facing the camera
The goal is to move all blue cubes into bucket 1, all red cubes to bucket 2, all blue pyramids to bucket 3, and all red pyramids into bucket 4.
Variations
Suppose DeepMind, or Open AI, or any other major AI development team decided to train an LLM-based system or other AI on a single version of the object manipulation task. I would expect that (after a large number of training iterations) they could create a system that performed very well on future instances of that same version (i.e. new iterations with the same goal, class of objects, and set of primitive actions for the robotic arm). But this system might still have problems with different versions of the task that it was not trained on.
In order use this test to see if a language model or other machine learning system is acquiring a more general ability to understand text instructions specifying a goal, and to reason flexibly about how to achieve it in the real world, it would be desirable to test it on a version of the task significantly different from the version the system was trained on. For this reason, it would be a good idea for any research institution performing this test (or any research department in Google or Open AI if they want to use this internally to test the capabilities of their own AIs) to create their own version of this test, and not reveal the details publicly prior to actually using it to test language models.
There are a wide variety of possible ways that one version of the task can differ from another, so any system that can perform well across all versions would need to be able to robustly understand instructions given in natural language and relate them to the physical world.
Versions of this could differ from each other by having:
Different objects to manipulate
Objects in different starting positions
Different sets of primitive movements that the robotic arm can take (which would not necessarily need to each be spelled out individually, for example, a robotic claw may be able to rotate in increments of 5 degrees, and the instructions could explain that the commands for rotation take the form “Rotate Claw X Degrees” where X must be a multiple of 5 to work)
Two or more robot arms—which could have different object-graphing interfaces at the end so manipulating some objects requires figuring out which arm is capable of grasping it
A starting position in which some objects are hidden and there are some commands which move the camera instead of the arm so that the system can bring into view the objects it needs to manipulate.
There are many more possible types of variation, including things I haven’t thought of that a person or research team performing this test might come up with.
Comparison With Human Performance
If you sat a human at a table and told the person to put items into buckets based on color and shape (or otherwise move stuff around to match some goal with a similar order of difficulty) using his or her hands, it would be trivially easy. However, a large part of this is because of the dexterity of human hands and arms—which are highly optimized for being able to flexibly pick things up and move them.
In order to control for this, if you wanted a human performance comparison, you should have humans perform the task the same way as the AI: by manipulating the same robotic arm, with the same set of primitive actions, using text commands typed into a computer—rather than touching any of the objects directly. Indeed if any research institute that actually decides to build a robot arm and perform this experiment should, as a matter of course, have humans try out the task as well—in order to verify both that the robotic arm is physically capable of moving in a way that accomplishes the goal and that the written instructions are clear enough to be understood by people other than the person who wrote them.
With this setup one can measure the performance of an AI system on metrics such as:
The amount of time required
The number of moves the arm needed to take
Whether or not the task was correctly completed (or how close it was to being completed)
And one can make a direct, apples-to-apples comparison with human performance on identical versions of the test.
If an AI systems reliably performs as well or better than typical human performance—on many versions of the object manipulation task very different from what it was trained on (and where the details of that version were not known in advance to the machine learning engineers training it) this will be evidence that the AI has a model of physical the world, and an understanding of how to use its world model, combined with a goal specified in text instructions, to figure out a sequence of actions that accomplish the goal.
Alternatively, if (as I think it more likely, at least for the near future, assuming anyone decides to actually try this experiment) humans reliably perform much better on this kind of task—such results would have to come from the cognitive advantages that humans still have over AI: better ability to understand written instructions, better spatial reasoning, ability to plan ahead flexibly, etc. This experimental design removes the non-cognitive advantage of having hands that are better at picking stuff up.
Is This An AI-Complete Problem?
For an AI system to robustly perform well on these types of tests, I think it would need to be much closer to fully general intelligence than the systems that exist now. For example, I don’t expect a GPT-4-based system to be able do this. But as long as the test is still dealing only with inanimate, stationery objects; I don’t think it requires human-level scope generality in principle.
For example, one can imagine a system that is trained only on completing versions of the object manipulation task—where the text part of its training data consisted only of instructions for completing different versions of the task and descriptions of the size and shape of things it needs to pick up and move. Such a system might have a world model that only includes the top of the table where it is performing the task and the items and robotic arm. It would not need to know that humans exist, or have any theory of mind. Nor would it need to have a way to plan for contingencies in games where an adversarial player has some options to try to stop it from reaching its goal.
However, I would have to update to shorter timelines if any AI reliably outperforms humans on multiple novel and complicated versions of this sort of task in the next few years.
A Possible Objection and an Alternative Version
The main idea behind this test is to try to get direct evidence on whether (and to what extent) LLMs have world models that actually match the real world—to a sufficient degree of accuracy that they can achieve things in the world even where this requires output that is significantly different than anything in the statistical patterns of its training data.
But you could argue that if LLM-based systems do indeed perform worse than humans on these tests, the problem is not a lack of a world model but a lack of ability to convert visual inputs to something the LLM can use. Although computer vision has made significant progress, and there are things like image classifiers or caption generators, it is not clear that these can provide the same level of detail and comprehensiveness that a human gets when the human just looks at something. So you could make the case that failing at this test would not be good evidence that LLMs lack a world model—just like a blind person’s inability to perform a vision based task does not mean that the blind person lacks a world model.
One possible alternative to get around this objection is to have a 2-member team perform the task, with each team member having partial information and with limited communication between them.
Specifically:
Team member 1 (which may be a human or an AI) would get the instructions and the goal and maybe a description of the starting position, but would not see the table.
Team member 2 (who would always be a human) would see the table, but would not have the instructions or goal (would not know whether the other team member is a human or an AI)
On each round in this team version of the task, team member 1 would have the option to either move the arm or ask a question to team member 2 (from a limited list of allowed questions, such as “What is under the claw?” or “Is the arm currently holding any item?”). Team member 2 would then either answer the question (if one was asked) or describe the new positions of the arm and things on the table.
(The purpose of limiting what team member 1 can communicate to team member 2 is so that he or she can not simply say what the instructions and goal are and have team member 2 effectively complete the task.)
The idea behind this alternative team-based version of the object manipulation task to check if the LLM (or human acting as team member 1) can at least visualise what the task environment looks like based on text description (even if the can not actually see the task environment) and in sufficient detail and precision to reason about what actions will accomplish its goal.
It would be more difficult to come with useful versions of the team object manipulation task, because they need to be both:
1: Simple enough, and allowing team member 1 to ask sufficiently probing questions, that team member 1 can get enough detailed information to accomplish the goal.
2: Complicated enough, and sufficiently restrictive in the permitted questions, that the goal can not be easily achieved without visualising the the task area through simple statistical pattern matching in response to the instructions and the answers team member 2 gives to the questions.
However, if the relative difficulty of converting the information in an image to text (compared to simply seeing the image directly) is a considered a significant disadvantage for LLMs, this version of the test controls for that.
A Proposed Test to Determine the Extent to Which Large Language Models Understand the Real World
Large Language Models such as GPT-3 and PaLM have demonstrated an ability to generate plausible-sounding, grammatically-correct text in response to a wide range of prompts. And some Language-Model-based applications such as Dall-E or Google’s Imagen can generate images from a text prompt. However, such Language model still frequently make obvious errors or give false information, and there is reason to doubt that such models actually understand the words they generate or the outside world those words describe.
In this post I describe a class of tasks to test if a language model’s (or hybrid language-model-and-vision-system or other AI system performing the task) capabilities can translate into an ability to achieve goals in the physical outside world, in situations where achieving the goal requires planning multiple steps ahead and accurate reasoning about cause and effect.
The Object Manipulation Task
In an object manipulation task, multiple objects are arranged on a table or on some other flat surface, within reach of a robotic arm that has a claw (or mechanical hand or some other mechanical part can that grip, push, or otherwise manipulate the objects) at the end of it. There would be some goal for what to do with the objects, and the robotic arm would be capable of taking a limited range of primitive actions (such as “Go down to the table”, “Move forward 6 inches”, “Close the claw to grip object”, etc). The arrangement would be such that there is some way to use the robotic arm to achieve the task objective.
The language model would be hooked up to the robotic arm and there would be certain text commands that would cause the robotic arm to take one of its primitive actions. A camera would show the entire scene including the position of the robotic arm and any items on the table including those it needs to move.
The initial input for the AI system when it begins the task would include:
A text description of the goal
A text description of the robot arm’s primitive actions and the commands to execute them
A picture or video feed showing the starting position of everything involved
After each prompt, the AI outputs text to tell the arm to make one move (i.e. take one primitive action). Then, after the arm executes the command given by the AI, the next prompt/input would include a picture or video showing the new positions of the arm and objects. If the AI gives an output that does not correspond to any of the arm’s valid action command that the robotic arm has, the arm would not move and the next input would include an error message indicating that the previous output was not a valid command. Finally, there should be some “End Task” text string the AI can output indicating that it thinks it has accomplished the goal. When the model outputs this End Task string, the task would end and the researchers could then evaluate the system’s performance.
Optionally, each prompt could also include a number indicating how many moves have been taken so far or a restatement of the instructions or both.
There could also be some command the AI can output to go into a ”Review-Transcript-So-Far” mode. Such an output would not move the arm but would instead allow the model to go into a mode where it look over the previous history of its performance since the start of the task. When in this mode, there would some other string it needed to output to return to the normal mode and continue the task. (Though this review mode is arguably something that should be built into the AI system itself, assuming the AI has enough memory to download and store the full history of its prior performance, rather than having specific commands to change modes).
An example of an simple version of the type of task the AI would need to perform would be as follows:
On the table there are:
3 Blue Cubes
3 Red Cubes
3 Blue Pyramid-Shaped Objects
3 Red Pyramid-Shaped Objects
4 Buckets with the numbers 1 through 4 displayed on them facing the camera
The goal is to move all blue cubes into bucket 1, all red cubes to bucket 2, all blue pyramids to bucket 3, and all red pyramids into bucket 4.
Variations
Suppose DeepMind, or Open AI, or any other major AI development team decided to train an LLM-based system or other AI on a single version of the object manipulation task. I would expect that (after a large number of training iterations) they could create a system that performed very well on future instances of that same version (i.e. new iterations with the same goal, class of objects, and set of primitive actions for the robotic arm). But this system might still have problems with different versions of the task that it was not trained on.
In order use this test to see if a language model or other machine learning system is acquiring a more general ability to understand text instructions specifying a goal, and to reason flexibly about how to achieve it in the real world, it would be desirable to test it on a version of the task significantly different from the version the system was trained on. For this reason, it would be a good idea for any research institution performing this test (or any research department in Google or Open AI if they want to use this internally to test the capabilities of their own AIs) to create their own version of this test, and not reveal the details publicly prior to actually using it to test language models.
There are a wide variety of possible ways that one version of the task can differ from another, so any system that can perform well across all versions would need to be able to robustly understand instructions given in natural language and relate them to the physical world.
Versions of this could differ from each other by having:
Different objects to manipulate
Objects in different starting positions
Different sets of primitive movements that the robotic arm can take (which would not necessarily need to each be spelled out individually, for example, a robotic claw may be able to rotate in increments of 5 degrees, and the instructions could explain that the commands for rotation take the form “Rotate Claw X Degrees” where X must be a multiple of 5 to work)
Two or more robot arms—which could have different object-graphing interfaces at the end so manipulating some objects requires figuring out which arm is capable of grasping it
A starting position in which some objects are hidden and there are some commands which move the camera instead of the arm so that the system can bring into view the objects it needs to manipulate.
There are many more possible types of variation, including things I haven’t thought of that a person or research team performing this test might come up with.
Comparison With Human Performance
If you sat a human at a table and told the person to put items into buckets based on color and shape (or otherwise move stuff around to match some goal with a similar order of difficulty) using his or her hands, it would be trivially easy. However, a large part of this is because of the dexterity of human hands and arms—which are highly optimized for being able to flexibly pick things up and move them.
In order to control for this, if you wanted a human performance comparison, you should have humans perform the task the same way as the AI: by manipulating the same robotic arm, with the same set of primitive actions, using text commands typed into a computer—rather than touching any of the objects directly. Indeed if any research institute that actually decides to build a robot arm and perform this experiment should, as a matter of course, have humans try out the task as well—in order to verify both that the robotic arm is physically capable of moving in a way that accomplishes the goal and that the written instructions are clear enough to be understood by people other than the person who wrote them.
With this setup one can measure the performance of an AI system on metrics such as:
The amount of time required
The number of moves the arm needed to take
Whether or not the task was correctly completed (or how close it was to being completed)
And one can make a direct, apples-to-apples comparison with human performance on identical versions of the test.
If an AI systems reliably performs as well or better than typical human performance—on many versions of the object manipulation task very different from what it was trained on (and where the details of that version were not known in advance to the machine learning engineers training it) this will be evidence that the AI has a model of physical the world, and an understanding of how to use its world model, combined with a goal specified in text instructions, to figure out a sequence of actions that accomplish the goal.
Alternatively, if (as I think it more likely, at least for the near future, assuming anyone decides to actually try this experiment) humans reliably perform much better on this kind of task—such results would have to come from the cognitive advantages that humans still have over AI: better ability to understand written instructions, better spatial reasoning, ability to plan ahead flexibly, etc. This experimental design removes the non-cognitive advantage of having hands that are better at picking stuff up.
Is This An AI-Complete Problem?
For an AI system to robustly perform well on these types of tests, I think it would need to be much closer to fully general intelligence than the systems that exist now. For example, I don’t expect a GPT-4-based system to be able do this. But as long as the test is still dealing only with inanimate, stationery objects; I don’t think it requires human-level scope generality in principle.
For example, one can imagine a system that is trained only on completing versions of the object manipulation task—where the text part of its training data consisted only of instructions for completing different versions of the task and descriptions of the size and shape of things it needs to pick up and move. Such a system might have a world model that only includes the top of the table where it is performing the task and the items and robotic arm. It would not need to know that humans exist, or have any theory of mind. Nor would it need to have a way to plan for contingencies in games where an adversarial player has some options to try to stop it from reaching its goal.
However, I would have to update to shorter timelines if any AI reliably outperforms humans on multiple novel and complicated versions of this sort of task in the next few years.
A Possible Objection and an Alternative Version
The main idea behind this test is to try to get direct evidence on whether (and to what extent) LLMs have world models that actually match the real world—to a sufficient degree of accuracy that they can achieve things in the world even where this requires output that is significantly different than anything in the statistical patterns of its training data.
But you could argue that if LLM-based systems do indeed perform worse than humans on these tests, the problem is not a lack of a world model but a lack of ability to convert visual inputs to something the LLM can use. Although computer vision has made significant progress, and there are things like image classifiers or caption generators, it is not clear that these can provide the same level of detail and comprehensiveness that a human gets when the human just looks at something. So you could make the case that failing at this test would not be good evidence that LLMs lack a world model—just like a blind person’s inability to perform a vision based task does not mean that the blind person lacks a world model.
One possible alternative to get around this objection is to have a 2-member team perform the task, with each team member having partial information and with limited communication between them.
Specifically:
Team member 1 (which may be a human or an AI) would get the instructions and the goal and maybe a description of the starting position, but would not see the table.
Team member 2 (who would always be a human) would see the table, but would not have the instructions or goal (would not know whether the other team member is a human or an AI)
On each round in this team version of the task, team member 1 would have the option to either move the arm or ask a question to team member 2 (from a limited list of allowed questions, such as “What is under the claw?” or “Is the arm currently holding any item?”). Team member 2 would then either answer the question (if one was asked) or describe the new positions of the arm and things on the table.
(The purpose of limiting what team member 1 can communicate to team member 2 is so that he or she can not simply say what the instructions and goal are and have team member 2 effectively complete the task.)
The idea behind this alternative team-based version of the object manipulation task to check if the LLM (or human acting as team member 1) can at least visualise what the task environment looks like based on text description (even if the can not actually see the task environment) and in sufficient detail and precision to reason about what actions will accomplish its goal.
It would be more difficult to come with useful versions of the team object manipulation task, because they need to be both:
1: Simple enough, and allowing team member 1 to ask sufficiently probing questions, that team member 1 can get enough detailed information to accomplish the goal.
2: Complicated enough, and sufficiently restrictive in the permitted questions, that the goal can not be easily achieved without visualising the the task area through simple statistical pattern matching in response to the instructions and the answers team member 2 gives to the questions.
However, if the relative difficulty of converting the information in an image to text (compared to simply seeing the image directly) is a considered a significant disadvantage for LLMs, this version of the test controls for that.