Oh cool! I put some effort into pursuing a very similar idea earlier:
I’ll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms.
but wasn’t sure of how exactly to test it or work on it so I didn’t get very far.
One idea that I had for testing it was rather different; make use of brain imaging research that seems able to map shared concepts between humans, and see whether that methodology could be used to also compare human-AI concepts:
A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.
We can now hypothesize some ways of testing the similarity of the AI’s concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human’s and an AI’s internal representations of concepts. Take a human’s neural activation when they’re thinking of some concept, and then take the AI’s internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human’s neural activation of the word “cat” to find the AI’s internal representation of the word “cat”? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.
One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human’s neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.
The farthest that I got with my general approach was “Defining Human Values for Value Learners”. It felt (and still feels) to me like concepts are quite task-specific: two people in the same environment will develop very different concepts depending on the job that they need to perform… or even depending on the tools that they have available. The spatial concepts of sailors practicing traditional Polynesian navigation are sufficiently different from those of modern sailors that the “traditionalists” have extreme difficulty understanding what the kinds of birds-eye-view maps we’re used to are even representing—and vice versa; Western anthropologists had considerable difficulties figuring out what exactly it was that the traditional navigation methods were even talking about.
(E.g. the traditional way of navigating from one island to another involves imagining a third “reference” island and tracking its location relative to the stars as the journey proceeds. Some anthropologists thought that this third island was meant as an “emergency island” to escape to in case of unforeseen trouble, an interpretation challenged by the fact that the reference island may sometimes be completely imagined, so obviously not suitable as a backup port. Chapter 2 of Hutchins 1995 has a detailed discussion of the way that different tools for performing navigation affect one’s conceptual representations, including the difficulties both the anthropologists and the traditional navigators had in trying to understand each other due to having incompatible concepts.)
Another example are legal concepts; e.g. American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control?
Eventually, the law was altered so that landowners couldn’t forbid airplanes from flying over their land. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. In that case, we can think that our concept for landownership existed for the purpose of some vaguely-defined task (enabling the things that are commonly associated with owning land); when technology developed in a way that the existing concept started interfering with another task we value (fast travel), the concept came to be redefined so as to enable both tasks most efficiently.
This seemed to suggest an interplay between concepts and values; our values are to some extent defined in terms of our concepts, but our values and the tools that we have available for furthering our values also affect that how we define our concepts. This line of thought led me to think that that interaction must be rooted in what was evolutionarily beneficial:
… evolution selects for agents which best maximize their fitness, while agents cannot directly optimize for their own fitness as they are unaware of it. Agents can however have a reward function that rewards behaviors which increase the fitness of the agents. The optimal reward function is one which maximizes (in expectation) the fitness of any agents having it. Holding the intelligence of the agents constant, the closer an agent’s reward function is to the optimal reward function, the higher their fitness will be. Evolution should thus be expected to select for reward functions that are closest to the optimal reward function. In other words, organisms should be expected to receive rewards for carrying out tasks which have been evolutionarily adaptive in the past. [...]
We should expect an evolutionarily successful organism to develop concepts that abstract over situations that are similar with regards to receiving a reward from the optimal reward function. Suppose that a certain action in state s1 gives the organism a reward, and that there are also states s2–s5 in which taking some specific action causes the organism to end up in s1. Then we should expect the organism to develop a common concept for being in the states s2–s5, and we should expect that concept to be “more similar” to the concept of being in state s1 than to the concept of being in some state that was many actions away.
In other words, we have some set of innate values that our brain is trying to optimize for; if concepts are task-specific, then this suggests that the kinds of concepts that will be natural to us are those which are beneficial for achieving our innate values given our current (social, physical and technological) environment. E.g. for a child, the concepts of “a child” and “an adult” will seem very natural, because there are quite a few things that an adult can do for furthering or hindering the child’s goals that fellow children can’t do. (And a specific subset of all adults named “mom and dad” is typically even more relevant for a particular child than any other adults are, making this an even more natural concept.)
That in turn seems to suggest that in order to see what concepts will be natural for humans, we need to look at fields such as psychology and neuroscience in order to figure out what our innate values are and how the interplay of innate and acquired values develops over time. I’ve had some hope that some of my later work on the structure and functioning of the mind would be relevant for that purpose.
On the role of values: values clearly do play some role in determining which abstractions we use. An alien who observes Earth but does not care about anything on Earth’s surface will likely not have a concept of trees, any more than an alien which has not observed Earth at all. Indifference has a similar effect to lack of data.
However, I expect that the space of abstractions is (approximately) discrete. A mind may use the tree-concept, or not use the tree-concept, but there is no natural abstraction arbitrarily-close-to-tree-but-not-the-same-as-tree. There is no continuum of tree-like abstractions.
So, under this model, values play a role in determining which abstractions we end up choosing, from the discrete set of available abstractions. But they do not play any role in determining the set of abstractions available. For AI/alignment purposes, this is all we need: as long as the set of natural abstractions is discrete and value-independent, and humans concepts are drawn from that set, we can precisely define human concepts without a detailed model of human values.
Also, a mostly-unrelated note on the airplane example: when we’re trying to “define” a concept by drawing a bounding box in some space (in this case, a literal bounding box in physical space), it is almost always the case that the bounding box will not actually correspond to the natural abstraction. This is basically the same idea as the cluster structure of thingspace and rubes vs bleggs. (Indeed, Bayesian clustering is directly interpretable as abstraction discovery: the cluster-statistics are the abstract summaries, and they induce conditional independence between the points in each cluster.) So I would interpret the airplanes exampe (and most similar examples in the legal system) not as a change in a natural concept, but rather as humans being bad at formally defining their natural concepts, and needing to update their definitions as new situations crop up. The definitions are not the natural concepts; they’re proxies.
However, I expect that the space of abstractions is (approximately) discrete. A mind may use the tree-concept, or not use the tree-concept, but there is no natural abstraction arbitrarily-close-to-tree-but-not-the-same-as-tree. There is no continuum of tree-like abstractions.
This doesn’t seem likely to me. Language is optimized for communicating ideas, but let’s take a simpler example than language: transmitting a 256x256 image of a dog or something, with a palette of 100 colors, and minimizing L2 error. I think that
The palette will be slightly different when minimizing L2 error in RGB space rather than HSL space
The palette will be slightly different when using a suboptimal algorithm (e.g. greedily choosing colors)
The palette will be slightly different when the image is of a slightly different dog
The palette will be slightly different when the image is of the same dog from a different angle
By analogy, shouldn’t concepts vary continuously with small changes in the system’s values, cognitive algorithms, training environment, and perceptual channels?
Another analogy: consider this clustering problem.
Different clustering algorithms will indeed find slightly different parameterizations of the clusters, slightly different cluster membership probabilities, etc. But those differences will be slight differences. We still expect different algorithms to cluster things in one of a few discrete ways—e.g. identifying the six main clusters, or only two (top and bottom, projected onto y-axis), or three (left, middle, right, projected onto x-axis), maybe just finding one big cluster if it’s a pretty shitty algorithm, etc. We would not expect to see an entire continuum of different clusters found, where the continuum ranges from “all six separate” to “one big cluster”; we would expect a discrete difference between those two clusterings.
Oh cool! I put some effort into pursuing a very similar idea earlier:
but wasn’t sure of how exactly to test it or work on it so I didn’t get very far.
One idea that I had for testing it was rather different; make use of brain imaging research that seems able to map shared concepts between humans, and see whether that methodology could be used to also compare human-AI concepts:
The farthest that I got with my general approach was “Defining Human Values for Value Learners”. It felt (and still feels) to me like concepts are quite task-specific: two people in the same environment will develop very different concepts depending on the job that they need to perform… or even depending on the tools that they have available. The spatial concepts of sailors practicing traditional Polynesian navigation are sufficiently different from those of modern sailors that the “traditionalists” have extreme difficulty understanding what the kinds of birds-eye-view maps we’re used to are even representing—and vice versa; Western anthropologists had considerable difficulties figuring out what exactly it was that the traditional navigation methods were even talking about.
(E.g. the traditional way of navigating from one island to another involves imagining a third “reference” island and tracking its location relative to the stars as the journey proceeds. Some anthropologists thought that this third island was meant as an “emergency island” to escape to in case of unforeseen trouble, an interpretation challenged by the fact that the reference island may sometimes be completely imagined, so obviously not suitable as a backup port. Chapter 2 of Hutchins 1995 has a detailed discussion of the way that different tools for performing navigation affect one’s conceptual representations, including the difficulties both the anthropologists and the traditional navigators had in trying to understand each other due to having incompatible concepts.)
Another example are legal concepts; e.g. American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control?
Eventually, the law was altered so that landowners couldn’t forbid airplanes from flying over their land. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. In that case, we can think that our concept for landownership existed for the purpose of some vaguely-defined task (enabling the things that are commonly associated with owning land); when technology developed in a way that the existing concept started interfering with another task we value (fast travel), the concept came to be redefined so as to enable both tasks most efficiently.
This seemed to suggest an interplay between concepts and values; our values are to some extent defined in terms of our concepts, but our values and the tools that we have available for furthering our values also affect that how we define our concepts. This line of thought led me to think that that interaction must be rooted in what was evolutionarily beneficial:
In other words, we have some set of innate values that our brain is trying to optimize for; if concepts are task-specific, then this suggests that the kinds of concepts that will be natural to us are those which are beneficial for achieving our innate values given our current (social, physical and technological) environment. E.g. for a child, the concepts of “a child” and “an adult” will seem very natural, because there are quite a few things that an adult can do for furthering or hindering the child’s goals that fellow children can’t do. (And a specific subset of all adults named “mom and dad” is typically even more relevant for a particular child than any other adults are, making this an even more natural concept.)
That in turn seems to suggest that in order to see what concepts will be natural for humans, we need to look at fields such as psychology and neuroscience in order to figure out what our innate values are and how the interplay of innate and acquired values develops over time. I’ve had some hope that some of my later work on the structure and functioning of the mind would be relevant for that purpose.
On the role of values: values clearly do play some role in determining which abstractions we use. An alien who observes Earth but does not care about anything on Earth’s surface will likely not have a concept of trees, any more than an alien which has not observed Earth at all. Indifference has a similar effect to lack of data.
However, I expect that the space of abstractions is (approximately) discrete. A mind may use the tree-concept, or not use the tree-concept, but there is no natural abstraction arbitrarily-close-to-tree-but-not-the-same-as-tree. There is no continuum of tree-like abstractions.
So, under this model, values play a role in determining which abstractions we end up choosing, from the discrete set of available abstractions. But they do not play any role in determining the set of abstractions available. For AI/alignment purposes, this is all we need: as long as the set of natural abstractions is discrete and value-independent, and humans concepts are drawn from that set, we can precisely define human concepts without a detailed model of human values.
Also, a mostly-unrelated note on the airplane example: when we’re trying to “define” a concept by drawing a bounding box in some space (in this case, a literal bounding box in physical space), it is almost always the case that the bounding box will not actually correspond to the natural abstraction. This is basically the same idea as the cluster structure of thingspace and rubes vs bleggs. (Indeed, Bayesian clustering is directly interpretable as abstraction discovery: the cluster-statistics are the abstract summaries, and they induce conditional independence between the points in each cluster.) So I would interpret the airplanes exampe (and most similar examples in the legal system) not as a change in a natural concept, but rather as humans being bad at formally defining their natural concepts, and needing to update their definitions as new situations crop up. The definitions are not the natural concepts; they’re proxies.
This doesn’t seem likely to me. Language is optimized for communicating ideas, but let’s take a simpler example than language: transmitting a 256x256 image of a dog or something, with a palette of 100 colors, and minimizing L2 error. I think that
The palette will be slightly different when minimizing L2 error in RGB space rather than HSL space
The palette will be slightly different when using a suboptimal algorithm (e.g. greedily choosing colors)
The palette will be slightly different when the image is of a slightly different dog
The palette will be slightly different when the image is of the same dog from a different angle
By analogy, shouldn’t concepts vary continuously with small changes in the system’s values, cognitive algorithms, training environment, and perceptual channels?
The key there is “slightly different”.
Another analogy: consider this clustering problem.
Different clustering algorithms will indeed find slightly different parameterizations of the clusters, slightly different cluster membership probabilities, etc. But those differences will be slight differences. We still expect different algorithms to cluster things in one of a few discrete ways—e.g. identifying the six main clusters, or only two (top and bottom, projected onto y-axis), or three (left, middle, right, projected onto x-axis), maybe just finding one big cluster if it’s a pretty shitty algorithm, etc. We would not expect to see an entire continuum of different clusters found, where the continuum ranges from “all six separate” to “one big cluster”; we would expect a discrete difference between those two clusterings.