Let’s take the very fundamental function of pointing. Every human language is rife with words called deictics that anchor the flow of utterance to specific pieces of the immediate environment. English examples are words like “this”, “that”, “near”, “far”, “soon”, “late”, the positional prepositions, pronominals like “me” and “you”—the meaning of these terms is grounded dynamically by the speakers and hearers in the time and place of utterance, the placement and salience of surrounding objects and structures, and the particular speaker and hearers and overhearers of the utterance. Human pointing—with the fingers, hands, eyes, chin, head tilt, elbow, whatever—has been shown to perform much the same functions as deictic speech in utterance. (See the work of Sotaro Kita if you’re interested in the data). A robot with no mechanism for pointing and no sensory apparatus for detecting the pointing gestures of human agents in its environment will misunderstand a great deal and will not be able to communicate fluently.
Are you really claiming that ability to understand the very concept of indexicality, and concepts like “soon”, “late”, “far”, etc., relies on humanlike fingers? That seems like an extraordinary claim, to put it lightly.
Also:
A robot with no mechanism for pointing and no sensory apparatus for detecting the pointing gestures of human agents in its environment will misunderstand a great deal and will not be able to communicate fluently.
“Detecting pointing gestures” would be the function of a perception algorithm, not a sensory apparatus (unless what you mean is “a robot with no ability to perceive positions/orientations/etc. of objects in its environment”, which… wouldn’t be very useful). So it’s a matter of what we do with sense data, not what sorts of body we have; that is, software, not hardware.
More generally, a lot of what you’re saying (and — this is my very tentative impression — a lot of the ideas of embodied cognition in general) seems to be based on an idea that we might create some general-intelligent AI or robot, but have it start at some “undeveloped” state and then proceed to “learn” or “evolve”, gathering concepts about the world, growing in understanding, until it achieves some desired level of intellectual development. The concern then arises that without the kind of embodiment that we humans enjoy, this AI will not develop the concepts necessary for it to understand us and vice versa.
Ok. But is anyone working in AI these days actually suggesting that this is how we should go about doing things? Is everyone working in AI these days suggesting that? Isn’t this entire line of reasoning inapplicable to whole broad swaths of possible approaches to AI design?
P.S. What does “there, relative to the river” mean?
Are you really claiming that ability to understand the very concept of indexicality, and concepts like “soon”, “late”, “far”, etc., relies on humanlike fingers? That seems like an extraordinary claim, to put it lightly.
Yeah, I am advancing the hypothesis that, in humans, the comprehension of indexicality relies on embodied pointing at its core—though not just with fingers, which are not universally used for pointing in all human cultures. Sotaro Kita has the most data on this subject for language, but the embodied basis of mathematics is discussed in Where Mathematics Comes From, by by Geroge Lakoff and Rafael Nunez . Whether all possible minds must rely on such a mechanism, I couldn’t possibly guess. But I am persuaded humans do (a lot of) it with their bodies.
What does “there, relative to the river” mean?
In most European cultures, we use speaker-relative deictics. If I point to the southeast while facing south and say “there”, I mean “generally to my front and left”. But if I turn around and face north, I will point to the northwest and say “there” to mean the same thing, ie, “generally to my front and left.” The fact that the physical direction of my pointing gesture is different is irrelevant in English; it’s my body position that’s used as a landmark for finding the target of “there”. (Unless I’m pointing at something in particular here and now, of course; in which case the target of the pointing action becomes its own landmark.)
In a number of Native American languages, the pointing is always to a cardinal direction. If the orientation of my body changes when I say “there”, I might point over my shoulder rather than to my front and left. The landmark for finding the target of “there” is a direction relative to the trajetory of the sun.
But many cultures use a dominant feature of the landscape, like the Amazon or the Missippi or the Nile rivers, or a major mountain range like the Rockies, or a sacred city like Mecca, as the orientation landmark, and in some cultures this gets encoded in the deictics of the language and the conventions for pointing. “Up” might not mean up vertically, but rather “upriver”, while “down” would be “downriver”. In a steep river valley in New Guinea, “down” could mean “toward the river” and “up” could mean “away from the river”. And “here” could mean “at the river” while “there” could mean “not at the river”.
The cultural variability and place-specificity of language was not widely known to Western linguists until about ten years ago. For a long time, it was assumed that person-relative orientation was a biological constraint on meaning. This turns out to be not quite accurate. So I guess I should be more nuanced in the way I present the notion of embodied cognition. How’s this: “Embodied action in the world with a cultural twist on top” is the grounding point at the bottom of the symbol expansion for human meanings, linguistic and otherwise.
If the orientation of my body changes when I say “there”, I might point over my shoulder rather than to my front and left.
I was able to follow this explanation (as well as the rest of your post) without seeing your physical body in any way. In addition, I suspect that, while you were typing your paragraph, you weren’t physically pointing at things. The fact that we can do this looks to me like evidence against your main thesis.
I was able to follow this explanation (as well as the rest of your post) without seeing your physical body in any way. … The fact that we can do this looks to me like evidence against your main thesis.
Ah, but you’re assuming that this particular interaction stands on its own. I’ll bet you were able to visualize the described gestures just fine by invoking memories of past interactions with bodies in the world.
Two points. First, I don’t contest the existence of verbal labels that merely refer—or even just register as being invoked without refering at all. As long as some labels are directly grounded to body/world, or refer to other labels that do get grounded in the body/world historically, we generally get by in routine situations. And all cultures have error detection and repair norms for conversation so that we can usually recover without social disaster.
However, the fact that verbal labels can be used without grounding them in the body/world is a problem. It is frequently the case that speakers and hearers alike don’t bother to connect words to reality, and this is a major source of misunderstanding, error, and nonsense. In our own case here and now, we are actually failing to understand each other fully because I can’t show you actual videotapes of what I’m talking about. You are rightly skeptical because words alone aren’t good enough evidence. And that is itself evidence.
Second, humans have a developmental trajectory and history, and memories of that history. We’re a time-binding animal in Korzybski’s terminology. I would suggest that an enculturated adult native speaker of a language will have what amount to “muscle memory” tics that can be invoked as needed to create referents. Mere memory of a motion or a perception is probably sufficient.
“Oh, look, it’s an invisible gesture!” is not at all convincing, I realize, so let me summarize several lines of evidence for it.
Developmentally, there’s quite a lot of research on language acquisition in infants and young children that suggests shared attention management—through indexical pointing, and shared gaze, and physical coercion of the body, and noises that trigger attention shift—is a critical building block for constructing “aboutness” in human language. We also start out with some shared, built-in cries and facial expressions linked to emotional states. At this level of development, communication largely fails unless there is a lot of embodied scaffolding for the interaction, much of it provided by the caregiver but a large part of it provided by the physical context of the interaction. There is also some evidence from the gestural communication of apes that attests to the importance of embodied attention management in communication.
Also, co-speech gesture turns out to be a human universal. Congenitally blind children do it, having never seen gesture by anyone else. Congenitally deaf children who spend time in groups together will invent entire gestural languages complete with formal syntax, as recently happened in Nicaragua. And adults speaking on the telephone will gesture even knowing they cannot be seen. Granted, people gesture in private at a significantly lower rate than they do face-to-face, but the fact that they do it at all is a bit of a puzzle, since the gestures can’t be serving a communicative function in these contexts. Does the gesturing help the speakers actually think, or at least make meaning more clear to themselves? Susan Goldin-Meadow and her colleagues think so.
We also know from video conversation data that adults spontaneously invent new gestures all the time in conversation, then reuse them. Interestingly, though, each reuse becomes more attentuated, simplified, and stylized with repetition. Similar effects are seen in the development of sign languages and in written scripts.
But just how embodied can a label be when gesture (and other embodied experience) is just a memory, and is so internalized that is is externally invisible? This has actually been tested experimentally. The Stroop effect has been known for decades, for example: when the word “red” is presented in blue text, it is read or acted on more slowly than when the word “red” is presented in red text—or in socially neutral black text. That’s on the embodied perception side of things. But more recent psychophysical experiments have demonstrated a similar psychomotor Stroop-like effect when spatial and motion stimulus sentences are semantically congruent with the direction of the required response action. This effect holds even for metaphorical words like “give”, which tests as motor-congruent with motion away from oneself, and “take”, which tests as motor-congruent with motion toward oneself.
I understand how counterintuitive this stuff can be when you first encounter it—especially to intelligent folks who work with codes or words or models a great deal. I expect the two of us will never reach a consensus on this without looking at a lot of original data—and who has the time to analyze all the data that exists on all the interesting problems in the world? I’d be pleased if you could just note for future reference that a body of empirical evidence exists for the claim. That’s all.
In our own case here and now, we are actually failing to understand each other fully because I can’t show you actual videotapes of what I’m talking about.
What do you mean by “fully” ? I believe I understand you well enough for all practical purposes. I don’t agree with you, but agreement and understanding are two different things.
First, I don’t contest the existence of verbal labels that merely refer—or even just register as being invoked without refering at all.
I’m not sure what you mean by “merely refer”, but keep in mind that we humans are able to communicate concepts which have no physical analogues that would be immediately accessible to our senses. For example, we can talk about things like “O(N)”, or “ribosome”, or “a^n +b^n = c^n”. We can also talk about entirely imaginary worlds, such as f.ex. the world where Mario, the turtle-crushing plumber, lives. And we can do this without having any “physical context” for the interaction, too.
All that is beside the point, however. In the rest of your post, you bring up a lot of evidence in support of your model of human development. That’s great, but your original claim was that any type of intelligence at all will require a physical body in order to develop; and nothing you’ve said so far is relevant to this claim. True, human intelligence is the only kind we know of so far, but then, at one point birds and insects were the only self-propelled flyers in existence—and that’s not the case anymore.
Furthermore, your also claimed that no simulation, no matter how realistic, will serve to replace the physical world for the purposes of human development, and I’m still not convinced that this is true, either. As I’d said before, we humans do not have perfect senses; if physical coordinates of real objects were snapped to a 0.01mm grid, no human child would ever notice. And in fact, there are plenty of humans who grow up and develop language just fine without the ability to see colors, or to move some of their limbs in order to point at things.
Just to drive the point home: even if I granted all of your arguments regarding humans, you would still need to demonstrate that human intelligence is the only possible kind of intelligence; that growing up in a human body is the only possible way to develop human intelligence; and that no simulation could in principle suffice, and the body must be physical. These are all very strong claims, and so far you have provided no evidence for any of them.
Let me refer you to Computation and Human Experience, by Philip E. Agre, and to Understanding Computers and Cognition, by Terry Winograd and Fernando Flores.
Yeah, I am advancing the hypothesis that, in humans, the comprehension of indexicality relies on embodied pointing at its core [...] Whether all possible minds must rely on such a mechanism, I couldn’t possibly guess. But I am persuaded humans do (a lot of) it with their bodies.
But wait; whether all possible minds must rely on such a mechanism is the entire question at hand! Humans implement this feature in some particular way? Fine; but this thread started by discussing what AIs and robots must do to implement the same feature. If implementation-specific details in humans don’t tell us anything interesting about implementation constraints in other minds, especially artificial minds which we are in theory free to place anywhere in mind design space, then the entire topic is almost completely irrelevant to an AI discussion (except possible as an example of “well, here is one way you could do it”).
In most European cultures, we use speaker-relative deictics. If I point to the southeast while facing south and say “there”, I mean “generally to my front and left”. But if I turn around and face north, I will point to the northwest and say “there” to mean the same thing, ie, “generally to my front and left.”
Er, what? I thought I was a member of a European culture, but I don’t think this is how I use the word “there”. If I point to some direction while facing somewhere, and say “there”, I mean… “in the direction I am pointing”.
The only situation when I’d use “there” in the way you describe is if I were describing some scenario involving myself located somewhere other than my current location, such that absolute directions in the story/scenario would not be the same as absolute directions in my current location.
In a steep river valley in New Guinea, “down” could mean “toward the river” and “up” could mean “away from the river”. And “here” could mean “at the river” while “there” could mean “not at the river”.
If this is accurate, then why on earth would we map this word in this language to the English “there”? It clearly does not remotely resemble how we use the word “there”, so this seems to be a case of poor translation rather than an example of cultural differences.
In a number of Native American languages, the pointing is always to a cardinal direction. [...] The cultural variability and place-specificity of language was not widely known to Western linguists until about ten years ago. For a long time, it was assumed that person-relative orientation was a biological constraint on meaning.
Yeah, actually, this research I was aware of. As I recall, the Native Americans in question had some difficulty understanding the Westerners’ concepts of speaker-relative indexicals. But note: if we can have such different concepts of indexicality, despite sharing the same pointing digits and whatnot… it seems premature, at best, to suggest that said hardware plays such a key role in our concept formation, much less in the possibility of having such concepts at all.
How’s this: “Embodied action in the world with a cultural twist on top” is the grounding point at the bottom of the symbol expansion for human meanings, linguistic and otherwise.
Ultimately, the interesting aspect of this entire discussion (imo, of course) is what these human-specific implementation details can tell us about other parts of mind design space. I remain skeptical that the answer is anything other than “not much”. (Incidentally, if you know of papers/books that address this aspect specifically, I would be interested.)
Are you really claiming that ability to understand the very concept of indexicality, and concepts like “soon”, “late”, “far”, etc., relies on humanlike fingers? That seems like an extraordinary claim, to put it lightly.
Also:
“Detecting pointing gestures” would be the function of a perception algorithm, not a sensory apparatus (unless what you mean is “a robot with no ability to perceive positions/orientations/etc. of objects in its environment”, which… wouldn’t be very useful). So it’s a matter of what we do with sense data, not what sorts of body we have; that is, software, not hardware.
More generally, a lot of what you’re saying (and — this is my very tentative impression — a lot of the ideas of embodied cognition in general) seems to be based on an idea that we might create some general-intelligent AI or robot, but have it start at some “undeveloped” state and then proceed to “learn” or “evolve”, gathering concepts about the world, growing in understanding, until it achieves some desired level of intellectual development. The concern then arises that without the kind of embodiment that we humans enjoy, this AI will not develop the concepts necessary for it to understand us and vice versa.
Ok. But is anyone working in AI these days actually suggesting that this is how we should go about doing things? Is everyone working in AI these days suggesting that? Isn’t this entire line of reasoning inapplicable to whole broad swaths of possible approaches to AI design?
P.S. What does “there, relative to the river” mean?
Yeah, I am advancing the hypothesis that, in humans, the comprehension of indexicality relies on embodied pointing at its core—though not just with fingers, which are not universally used for pointing in all human cultures. Sotaro Kita has the most data on this subject for language, but the embodied basis of mathematics is discussed in Where Mathematics Comes From, by by Geroge Lakoff and Rafael Nunez . Whether all possible minds must rely on such a mechanism, I couldn’t possibly guess. But I am persuaded humans do (a lot of) it with their bodies.
In most European cultures, we use speaker-relative deictics. If I point to the southeast while facing south and say “there”, I mean “generally to my front and left”. But if I turn around and face north, I will point to the northwest and say “there” to mean the same thing, ie, “generally to my front and left.” The fact that the physical direction of my pointing gesture is different is irrelevant in English; it’s my body position that’s used as a landmark for finding the target of “there”. (Unless I’m pointing at something in particular here and now, of course; in which case the target of the pointing action becomes its own landmark.)
In a number of Native American languages, the pointing is always to a cardinal direction. If the orientation of my body changes when I say “there”, I might point over my shoulder rather than to my front and left. The landmark for finding the target of “there” is a direction relative to the trajetory of the sun.
But many cultures use a dominant feature of the landscape, like the Amazon or the Missippi or the Nile rivers, or a major mountain range like the Rockies, or a sacred city like Mecca, as the orientation landmark, and in some cultures this gets encoded in the deictics of the language and the conventions for pointing. “Up” might not mean up vertically, but rather “upriver”, while “down” would be “downriver”. In a steep river valley in New Guinea, “down” could mean “toward the river” and “up” could mean “away from the river”. And “here” could mean “at the river” while “there” could mean “not at the river”.
The cultural variability and place-specificity of language was not widely known to Western linguists until about ten years ago. For a long time, it was assumed that person-relative orientation was a biological constraint on meaning. This turns out to be not quite accurate. So I guess I should be more nuanced in the way I present the notion of embodied cognition. How’s this: “Embodied action in the world with a cultural twist on top” is the grounding point at the bottom of the symbol expansion for human meanings, linguistic and otherwise.
I was able to follow this explanation (as well as the rest of your post) without seeing your physical body in any way. In addition, I suspect that, while you were typing your paragraph, you weren’t physically pointing at things. The fact that we can do this looks to me like evidence against your main thesis.
Ah, but you’re assuming that this particular interaction stands on its own. I’ll bet you were able to visualize the described gestures just fine by invoking memories of past interactions with bodies in the world.
Two points. First, I don’t contest the existence of verbal labels that merely refer—or even just register as being invoked without refering at all. As long as some labels are directly grounded to body/world, or refer to other labels that do get grounded in the body/world historically, we generally get by in routine situations. And all cultures have error detection and repair norms for conversation so that we can usually recover without social disaster.
However, the fact that verbal labels can be used without grounding them in the body/world is a problem. It is frequently the case that speakers and hearers alike don’t bother to connect words to reality, and this is a major source of misunderstanding, error, and nonsense. In our own case here and now, we are actually failing to understand each other fully because I can’t show you actual videotapes of what I’m talking about. You are rightly skeptical because words alone aren’t good enough evidence. And that is itself evidence.
Second, humans have a developmental trajectory and history, and memories of that history. We’re a time-binding animal in Korzybski’s terminology. I would suggest that an enculturated adult native speaker of a language will have what amount to “muscle memory” tics that can be invoked as needed to create referents. Mere memory of a motion or a perception is probably sufficient.
“Oh, look, it’s an invisible gesture!” is not at all convincing, I realize, so let me summarize several lines of evidence for it.
Developmentally, there’s quite a lot of research on language acquisition in infants and young children that suggests shared attention management—through indexical pointing, and shared gaze, and physical coercion of the body, and noises that trigger attention shift—is a critical building block for constructing “aboutness” in human language. We also start out with some shared, built-in cries and facial expressions linked to emotional states. At this level of development, communication largely fails unless there is a lot of embodied scaffolding for the interaction, much of it provided by the caregiver but a large part of it provided by the physical context of the interaction. There is also some evidence from the gestural communication of apes that attests to the importance of embodied attention management in communication.
Also, co-speech gesture turns out to be a human universal. Congenitally blind children do it, having never seen gesture by anyone else. Congenitally deaf children who spend time in groups together will invent entire gestural languages complete with formal syntax, as recently happened in Nicaragua. And adults speaking on the telephone will gesture even knowing they cannot be seen. Granted, people gesture in private at a significantly lower rate than they do face-to-face, but the fact that they do it at all is a bit of a puzzle, since the gestures can’t be serving a communicative function in these contexts. Does the gesturing help the speakers actually think, or at least make meaning more clear to themselves? Susan Goldin-Meadow and her colleagues think so.
We also know from video conversation data that adults spontaneously invent new gestures all the time in conversation, then reuse them. Interestingly, though, each reuse becomes more attentuated, simplified, and stylized with repetition. Similar effects are seen in the development of sign languages and in written scripts.
But just how embodied can a label be when gesture (and other embodied experience) is just a memory, and is so internalized that is is externally invisible? This has actually been tested experimentally. The Stroop effect has been known for decades, for example: when the word “red” is presented in blue text, it is read or acted on more slowly than when the word “red” is presented in red text—or in socially neutral black text. That’s on the embodied perception side of things. But more recent psychophysical experiments have demonstrated a similar psychomotor Stroop-like effect when spatial and motion stimulus sentences are semantically congruent with the direction of the required response action. This effect holds even for metaphorical words like “give”, which tests as motor-congruent with motion away from oneself, and “take”, which tests as motor-congruent with motion toward oneself.
I understand how counterintuitive this stuff can be when you first encounter it—especially to intelligent folks who work with codes or words or models a great deal. I expect the two of us will never reach a consensus on this without looking at a lot of original data—and who has the time to analyze all the data that exists on all the interesting problems in the world? I’d be pleased if you could just note for future reference that a body of empirical evidence exists for the claim. That’s all.
What do you mean by “fully” ? I believe I understand you well enough for all practical purposes. I don’t agree with you, but agreement and understanding are two different things.
I’m not sure what you mean by “merely refer”, but keep in mind that we humans are able to communicate concepts which have no physical analogues that would be immediately accessible to our senses. For example, we can talk about things like “O(N)”, or “ribosome”, or “a^n +b^n = c^n”. We can also talk about entirely imaginary worlds, such as f.ex. the world where Mario, the turtle-crushing plumber, lives. And we can do this without having any “physical context” for the interaction, too.
All that is beside the point, however. In the rest of your post, you bring up a lot of evidence in support of your model of human development. That’s great, but your original claim was that any type of intelligence at all will require a physical body in order to develop; and nothing you’ve said so far is relevant to this claim. True, human intelligence is the only kind we know of so far, but then, at one point birds and insects were the only self-propelled flyers in existence—and that’s not the case anymore.
Furthermore, your also claimed that no simulation, no matter how realistic, will serve to replace the physical world for the purposes of human development, and I’m still not convinced that this is true, either. As I’d said before, we humans do not have perfect senses; if physical coordinates of real objects were snapped to a 0.01mm grid, no human child would ever notice. And in fact, there are plenty of humans who grow up and develop language just fine without the ability to see colors, or to move some of their limbs in order to point at things.
Just to drive the point home: even if I granted all of your arguments regarding humans, you would still need to demonstrate that human intelligence is the only possible kind of intelligence; that growing up in a human body is the only possible way to develop human intelligence; and that no simulation could in principle suffice, and the body must be physical. These are all very strong claims, and so far you have provided no evidence for any of them.
Let me refer you to Computation and Human Experience, by Philip E. Agre, and to Understanding Computers and Cognition, by Terry Winograd and Fernando Flores.
Can you summarize the salient parts ?
But wait; whether all possible minds must rely on such a mechanism is the entire question at hand! Humans implement this feature in some particular way? Fine; but this thread started by discussing what AIs and robots must do to implement the same feature. If implementation-specific details in humans don’t tell us anything interesting about implementation constraints in other minds, especially artificial minds which we are in theory free to place anywhere in mind design space, then the entire topic is almost completely irrelevant to an AI discussion (except possible as an example of “well, here is one way you could do it”).
Er, what? I thought I was a member of a European culture, but I don’t think this is how I use the word “there”. If I point to some direction while facing somewhere, and say “there”, I mean… “in the direction I am pointing”.
The only situation when I’d use “there” in the way you describe is if I were describing some scenario involving myself located somewhere other than my current location, such that absolute directions in the story/scenario would not be the same as absolute directions in my current location.
If this is accurate, then why on earth would we map this word in this language to the English “there”? It clearly does not remotely resemble how we use the word “there”, so this seems to be a case of poor translation rather than an example of cultural differences.
Yeah, actually, this research I was aware of. As I recall, the Native Americans in question had some difficulty understanding the Westerners’ concepts of speaker-relative indexicals. But note: if we can have such different concepts of indexicality, despite sharing the same pointing digits and whatnot… it seems premature, at best, to suggest that said hardware plays such a key role in our concept formation, much less in the possibility of having such concepts at all.
Ultimately, the interesting aspect of this entire discussion (imo, of course) is what these human-specific implementation details can tell us about other parts of mind design space. I remain skeptical that the answer is anything other than “not much”. (Incidentally, if you know of papers/books that address this aspect specifically, I would be interested.)