Technical jargon is quite useful. It allows us to put sharp definitions onto natural phenomena useful for precise scientific discussion. However, it can also be confusing, since you are creating or re-using words to mean this new specific precise thing. So, whenever you write an independent piece of text (e.g. a scientific article or an internet post), you need to define any jargon you use which is not yet accepted as common knowledge within the field you are writing for.
Once upon a time, long before personal computers were invented (much less anything like ‘deep neural nets’), some scientists were discussing a concept. This concept was something like “obvious and useful features in an environment which tend to independently reoccur as concepts in the minds of intelligent agents”. In fact, these scientists realized that these two attributes, obvious and useful, were scalar values which had positive predictive value for whether the concepts would arise.
This was in the time before ubiquitous computers, so the intelligent agents being studied were humans. In this disconnected pre-‘Information Era’ age there were still many groups of humans who had never communicated with each other. They had infinite degrees of Kevin Bacon, if you can imagine that! The scientists, who today we would call ‘social anthropologists’ or ‘linguists’ went to these isolated groups and took careful note of which concepts these people had words for. A group of people having a commonly understood word for a thing is a good proxy for them having that concept in their minds.
One example of a set of concepts that were thoroughly studied was color. Not all colors are equally obvious and useful. Some sets of colors are harder for some humans to distinguish between, such as blue/green or red/green or yellow/green. Nevertheless, most humans can distinguish reasonably well between strong hue differences in natural environments, so there isn’t a huge difference in ‘obviousness’ between colors. This provides a nice natural experiment isolating ‘usefulness’. Colors which are more useful in the an environment are much more likely to exist as concepts in the minds of intelligent agents interacting with that environment. The scientists found that ‘red’ tended to be a more useful color in most of the environments studied, and more consistently represented in independent languages than other colors. The usefulness of red is due in part to the physics and chemical makeup of Earth making red a ‘stand out’ color across species, thus evolution has convergently led to many different species using red to signal (honestly or not) toxicity.
Unfortunately, due to limited such natural experiments of isolated groups, and no ability to deliberately experiment on groups of people, conclusive resolution of these hypotheses has not been reached. Thus, scientific debate is ongoing. For more details, see https://en.wikipedia.org/wiki/Linguistic_relativity_and_the_color_naming_debate
Similarly, numbers were found to differ between groups in predictable ways. The numbers [1,2] were much more common than [1,2,...,10], which in turn were more common than numbers up to 20. It seems strange to our modern minds accustomed to infinite integers to have a counting system with only three numbers [1,2,3] in it, but it is clear to see how [1,2,3] are more obvious and useful in a hunting/gathering environment than numbers above a hundred or a million. Interestingly, both humans and other smart animals like monkeys and ravens can count low numbers of objects even when they have no words at all for numbers. Here’s a link to a brief blurb about counting in humans https://en.wikipedia.org/wiki/Numeral_%28linguistics%29#Basis_of_counting_system
Sometime much closer to now, some scientists (in particular Chris Olah, but others as well) noted that when a useful and obvious feature was present in a dataset that that feature would tend to be represented in a variety of different models trained on that dataset. Specifically, Chris Olah gives the example of finding features in vision models trained to (among other things) distinguish dog breeds which correspond to the floppiness of dog ears. Whether a dog’s ears tend to stand up or flop over is a useful distinguishing feature for breeds, and also fairly obvious to vision models.
A few years later, another scientist named John Wentworth was trying to understand how we could better figure out what a complex machine learning model was ‘thinking’ about, or what it ‘understood’, even when the exact architecture and training history of that model varied significantly. He decided that a very important piece finding concepts in the minds of intelligent agents was tracing them back to “obvious and useful features in an environment which tend to independently reoccur as concepts in the minds of intelligent agents”. That’s a bit of a mouthful, so he coined the term “natural abstractions” to mean this. I’m a big fan of his thinking in general and this piece of technical jargon in particular. I think this term ‘natural abstractions’ is a clear way to point to things in reality which may or may not end up well represented in the features of a particular dataset/environment, and may or may not end up represented as concepts in the ‘mind’ of a particular intelligent agent trained on that dataset. Not all dataset features are natural abstractions, but I think we’ll find that the most important ones are. Not all concepts are natural abstractions, but I think we’ll find that the most important ones are. Minds are faulty sometimes, and learning is a messy process sometimes, so sometimes there are concepts that just don’t map well to the ‘territory’ of natural abstractions. An example of a class of concepts in humans that maps poorly to natural abstractions is superstitions.
Anyway, the point of this post is that John Wentworth was nowhere near the first scientist to discuss “obvious and useful features in an environment which tend to independently reoccur as concepts in the minds of intelligent agents”. He invented jargon to describe this concept, and is doing important work trying to formalize this concept and describe its relevance to the AGI alignment problem.
I would very much like for John to redefine his personal jargon ‘natural abstractions’ in any post of substantial length in which he uses it, because it is not yet a well-enough-established term in a well-established field such that it makes sense to assume readers will know ahead of time. I think this would help reduce the annoying tendency for the comment sections on John’s posts to get cluttered up with people asking things like, “What are these natural abstractions you are talking about? How sure are you that they even exist and are important to study?” It’s just new (and useful) terminology for a concept that has been acknowledged and discussed by scientists for many generations now. Scientists were writing about this in the 1800s. Linguists and social anthropologists describe this in human languages, psychologists describe this in individual people, neuroscientists describe this in animals. And now, AI interpretability researchers are describing it in machine learning models.
Please (re)explain your personal jargon
Technical jargon is quite useful. It allows us to put sharp definitions onto natural phenomena useful for precise scientific discussion. However, it can also be confusing, since you are creating or re-using words to mean this new specific precise thing. So, whenever you write an independent piece of text (e.g. a scientific article or an internet post), you need to define any jargon you use which is not yet accepted as common knowledge within the field you are writing for.
Once upon a time, long before personal computers were invented (much less anything like ‘deep neural nets’), some scientists were discussing a concept. This concept was something like “obvious and useful features in an environment which tend to independently reoccur as concepts in the minds of intelligent agents”. In fact, these scientists realized that these two attributes, obvious and useful, were scalar values which had positive predictive value for whether the concepts would arise.
This was in the time before ubiquitous computers, so the intelligent agents being studied were humans. In this disconnected pre-‘Information Era’ age there were still many groups of humans who had never communicated with each other. They had infinite degrees of Kevin Bacon, if you can imagine that! The scientists, who today we would call ‘social anthropologists’ or ‘linguists’ went to these isolated groups and took careful note of which concepts these people had words for. A group of people having a commonly understood word for a thing is a good proxy for them having that concept in their minds.
One example of a set of concepts that were thoroughly studied was color. Not all colors are equally obvious and useful. Some sets of colors are harder for some humans to distinguish between, such as blue/green or red/green or yellow/green. Nevertheless, most humans can distinguish reasonably well between strong hue differences in natural environments, so there isn’t a huge difference in ‘obviousness’ between colors. This provides a nice natural experiment isolating ‘usefulness’. Colors which are more useful in the an environment are much more likely to exist as concepts in the minds of intelligent agents interacting with that environment. The scientists found that ‘red’ tended to be a more useful color in most of the environments studied, and more consistently represented in independent languages than other colors. The usefulness of red is due in part to the physics and chemical makeup of Earth making red a ‘stand out’ color across species, thus evolution has convergently led to many different species using red to signal (honestly or not) toxicity.
Unfortunately, due to limited such natural experiments of isolated groups, and no ability to deliberately experiment on groups of people, conclusive resolution of these hypotheses has not been reached. Thus, scientific debate is ongoing. For more details, see https://en.wikipedia.org/wiki/Linguistic_relativity_and_the_color_naming_debate
Similarly, numbers were found to differ between groups in predictable ways. The numbers [1,2] were much more common than [1,2,...,10], which in turn were more common than numbers up to 20. It seems strange to our modern minds accustomed to infinite integers to have a counting system with only three numbers [1,2,3] in it, but it is clear to see how [1,2,3] are more obvious and useful in a hunting/gathering environment than numbers above a hundred or a million. Interestingly, both humans and other smart animals like monkeys and ravens can count low numbers of objects even when they have no words at all for numbers. Here’s a link to a brief blurb about counting in humans https://en.wikipedia.org/wiki/Numeral_%28linguistics%29#Basis_of_counting_system
Sometime much closer to now, some scientists (in particular Chris Olah, but others as well) noted that when a useful and obvious feature was present in a dataset that that feature would tend to be represented in a variety of different models trained on that dataset. Specifically, Chris Olah gives the example of finding features in vision models trained to (among other things) distinguish dog breeds which correspond to the floppiness of dog ears. Whether a dog’s ears tend to stand up or flop over is a useful distinguishing feature for breeds, and also fairly obvious to vision models.
A few years later, another scientist named John Wentworth was trying to understand how we could better figure out what a complex machine learning model was ‘thinking’ about, or what it ‘understood’, even when the exact architecture and training history of that model varied significantly. He decided that a very important piece finding concepts in the minds of intelligent agents was tracing them back to “obvious and useful features in an environment which tend to independently reoccur as concepts in the minds of intelligent agents”. That’s a bit of a mouthful, so he coined the term “natural abstractions” to mean this. I’m a big fan of his thinking in general and this piece of technical jargon in particular. I think this term ‘natural abstractions’ is a clear way to point to things in reality which may or may not end up well represented in the features of a particular dataset/environment, and may or may not end up represented as concepts in the ‘mind’ of a particular intelligent agent trained on that dataset. Not all dataset features are natural abstractions, but I think we’ll find that the most important ones are. Not all concepts are natural abstractions, but I think we’ll find that the most important ones are. Minds are faulty sometimes, and learning is a messy process sometimes, so sometimes there are concepts that just don’t map well to the ‘territory’ of natural abstractions. An example of a class of concepts in humans that maps poorly to natural abstractions is superstitions.
Anyway, the point of this post is that John Wentworth was nowhere near the first scientist to discuss “obvious and useful features in an environment which tend to independently reoccur as concepts in the minds of intelligent agents”. He invented jargon to describe this concept, and is doing important work trying to formalize this concept and describe its relevance to the AGI alignment problem.
I would very much like for John to redefine his personal jargon ‘natural abstractions’ in any post of substantial length in which he uses it, because it is not yet a well-enough-established term in a well-established field such that it makes sense to assume readers will know ahead of time. I think this would help reduce the annoying tendency for the comment sections on John’s posts to get cluttered up with people asking things like, “What are these natural abstractions you are talking about? How sure are you that they even exist and are important to study?” It’s just new (and useful) terminology for a concept that has been acknowledged and discussed by scientists for many generations now. Scientists were writing about this in the 1800s. Linguists and social anthropologists describe this in human languages, psychologists describe this in individual people, neuroscientists describe this in animals. And now, AI interpretability researchers are describing it in machine learning models.