Given that Anthropic basically extracted the abstractions from the middle layer of Claude Sonnet, and OpenAI recently did the same for models up to GPT-4, and that most of the results they found were obvious natural abstractions to a human, I’d say we now have pretty conclusive evidence that you’re correct and that (your model of) Eliezer is mistaken on this. Which isn’t really very surprising for models whose base model was trained on the task of predicting text from Internet: they were distilled from humans and they think similarly.
Note that for your argument above it’s not fatal if the AI’s ontology is a superset of ours: as long as we’re comprehensible to them with a relatively short description, they can understand what we want.
(I’m the first author of the linked paper on GPT-4 autoencoders.)
I think many people are heavily overrating how human-explainable SAEs today are, because it’s quite subtle to determine whether a feature is genuinely explainable. SAE features today, even in the best SAEs, are generally are not explainable with simple human understandable explanations. By “explainable,” I mean there is a human understandable procedure for labeling whether the feature should activate on a given token (and also how strong the activation should be, but I’ll ignore that for now), such that your procedure predicts an activation if and only if the latent actually activates.
There are a few problems with interpretable-looking features:
it is insufficient that latent-activating samples have a common explanation. You also need the opposite direction of things that match the explanation to activate the latent. For example, we found a neuron in GPT-2 that appears to activate on the word “stop,” but actually most instances of the word “stop” don’t activate the neuron. It turns out that this was not really a “stop” neuron, but rather a “don’t stop/won’t stop” neuron. While in this case there was
a different but still simple explanation, it’s entirely plausible that many features just cannot be explained with simple explanations. This problem gets worse as autoencoders scale, because their explanations will get more and more specific.
People often look at the top activating examples of a latent, but this provides a heavily misleading picture of how monosemantic the latent is even just in the one direction. It’s very common for features to have extremely good top activations but then terrible nonzero activations. This is why our feature visualizer shows random nonzero activations before the top activations.
Oftentimes, it is actually harder to simulate a latent than it looks. For example, we often find latents that activate on words in a specific context- say, financial news articles- but it seems to activate on random words inside those contexts and we don’t have a good explanation why it activates on some words but not others.
We also discuss this in the evaluation section of our paper on GPT-4 autoencoders. The ultimate metric of whether the features are explainable that we introduce is the following: simulate each latent with your best explanation of the latent, and then run the values through the decoder and the rest of the model and look at the downstream loss. This procedure is very expensive, so making it feasible to run is a nontrivial research problem, but I predict basically all existing autoencoders will score terribly on this metric.
I do not think SAE results to date contribute very strong evidence in either direction. “Extract all the abstractions from a layer” is not obviously an accurate statement of what they do, and the features they do find do not obviously faithfully and robustly map to human concepts, and even if they did it’s not clear that they compose in human-like ways. They are some evidence, but weak.
(In fact, we know that the fraction of features extracted is probably quite small—for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)
I would certainly agree that this evidence is new, preliminary, and not dispositive. But I would claim that it’s not at all what I’d expect to find in the most abstracted layer of something matching the following description:
…the stuff within that opaque container is, very likely, incredibly alien—nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
Instead we’re finding stuff like that the Golden Gate Bridge is related to Alcatraz and San Francisco and to bridges and to tourist destinations: i.e. something that looks like a more abstract version of WordNet. . This is a semantic structure that looks like it should understand human metaphor and simile. And when we look for concepts that seem like they would be related to basic alignment issues, we can find them. (I also don’t view all this as very surprising, given how LLMs are trained, distilling their intelligence from humans, though I’m delighted to have it confirmed at scale.)
(I don’t offhand recall when that Eliezer quote comes from: the fact that this was going to work out this well for us was vastly less obvious, say, 5 years ago. and not exactly clear even a year ago: obviously Eliezer is allowed to update his worldview as discoveries are made, just like anyone else.)
(It seems to me that you didn’t read Eliezer’s comment response to this, which also aligns with my model. Finding any overlap between abstractions is extremely far from showing that the abstractions relevant to controlling or aligning AI systems will match)
Given that Anthropic basically extracted the abstractions from the middle layer of Claude Sonnet, and OpenAI recently did the same for models up to GPT-4, and that most of the results they found were obvious natural abstractions to a human, I’d say we now have pretty conclusive evidence that you’re correct and that (your model of) Eliezer is mistaken on this. Which isn’t really very surprising for models whose base model was trained on the task of predicting text from Internet: they were distilled from humans and they think similarly.
Note that for your argument above it’s not fatal if the AI’s ontology is a superset of ours: as long as we’re comprehensible to them with a relatively short description, they can understand what we want.
(I’m the first author of the linked paper on GPT-4 autoencoders.)
I think many people are heavily overrating how human-explainable SAEs today are, because it’s quite subtle to determine whether a feature is genuinely explainable. SAE features today, even in the best SAEs, are generally are not explainable with simple human understandable explanations. By “explainable,” I mean there is a human understandable procedure for labeling whether the feature should activate on a given token (and also how strong the activation should be, but I’ll ignore that for now), such that your procedure predicts an activation if and only if the latent actually activates.
There are a few problems with interpretable-looking features:
it is insufficient that latent-activating samples have a common explanation. You also need the opposite direction of things that match the explanation to activate the latent. For example, we found a neuron in GPT-2 that appears to activate on the word “stop,” but actually most instances of the word “stop” don’t activate the neuron. It turns out that this was not really a “stop” neuron, but rather a “don’t stop/won’t stop” neuron. While in this case there was a different but still simple explanation, it’s entirely plausible that many features just cannot be explained with simple explanations. This problem gets worse as autoencoders scale, because their explanations will get more and more specific.
People often look at the top activating examples of a latent, but this provides a heavily misleading picture of how monosemantic the latent is even just in the one direction. It’s very common for features to have extremely good top activations but then terrible nonzero activations. This is why our feature visualizer shows random nonzero activations before the top activations.
Oftentimes, it is actually harder to simulate a latent than it looks. For example, we often find latents that activate on words in a specific context- say, financial news articles- but it seems to activate on random words inside those contexts and we don’t have a good explanation why it activates on some words but not others.
We also discuss this in the evaluation section of our paper on GPT-4 autoencoders. The ultimate metric of whether the features are explainable that we introduce is the following: simulate each latent with your best explanation of the latent, and then run the values through the decoder and the rest of the model and look at the downstream loss. This procedure is very expensive, so making it feasible to run is a nontrivial research problem, but I predict basically all existing autoencoders will score terribly on this metric.
I do not think SAE results to date contribute very strong evidence in either direction. “Extract all the abstractions from a layer” is not obviously an accurate statement of what they do, and the features they do find do not obviously faithfully and robustly map to human concepts, and even if they did it’s not clear that they compose in human-like ways. They are some evidence, but weak.
(In fact, we know that the fraction of features extracted is probably quite small—for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)
I would certainly agree that this evidence is new, preliminary, and not dispositive. But I would claim that it’s not at all what I’d expect to find in the most abstracted layer of something matching the following description:
Instead we’re finding stuff like that the Golden Gate Bridge is related to Alcatraz and San Francisco and to bridges and to tourist destinations: i.e. something that looks like a more abstract version of WordNet. . This is a semantic structure that looks like it should understand human metaphor and simile. And when we look for concepts that seem like they would be related to basic alignment issues, we can find them. (I also don’t view all this as very surprising, given how LLMs are trained, distilling their intelligence from humans, though I’m delighted to have it confirmed at scale.)
(I don’t offhand recall when that Eliezer quote comes from: the fact that this was going to work out this well for us was vastly less obvious, say, 5 years ago. and not exactly clear even a year ago: obviously Eliezer is allowed to update his worldview as discoveries are made, just like anyone else.)
(It seems to me that you didn’t read Eliezer’s comment response to this, which also aligns with my model. Finding any overlap between abstractions is extremely far from showing that the abstractions relevant to controlling or aligning AI systems will match)
LLMs would be expected to have heavily overlapping ontologies, a question is what capability boosting does to the AI ontology.