I do not think SAE results to date contribute very strong evidence in either direction. “Extract all the abstractions from a layer” is not obviously an accurate statement of what they do, and the features they do find do not obviously faithfully and robustly map to human concepts, and even if they did it’s not clear that they compose in human-like ways. They are some evidence, but weak.
(In fact, we know that the fraction of features extracted is probably quite small—for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)
I would certainly agree that this evidence is new, preliminary, and not dispositive. But I would claim that it’s not at all what I’d expect to find in the most abstracted layer of something matching the following description:
…the stuff within that opaque container is, very likely, incredibly alien—nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
Instead we’re finding stuff like that the Golden Gate Bridge is related to Alcatraz and San Francisco and to bridges and to tourist destinations: i.e. something that looks like a more abstract version of WordNet. . This is a semantic structure that looks like it should understand human metaphor and simile. And when we look for concepts that seem like they would be related to basic alignment issues, we can find them. (I also don’t view all this as very surprising, given how LLMs are trained, distilling their intelligence from humans, though I’m delighted to have it confirmed at scale.)
(I don’t offhand recall when that Eliezer quote comes from: the fact that this was going to work out this well for us was vastly less obvious, say, 5 years ago. and not exactly clear even a year ago: obviously Eliezer is allowed to update his worldview as discoveries are made, just like anyone else.)
I do not think SAE results to date contribute very strong evidence in either direction. “Extract all the abstractions from a layer” is not obviously an accurate statement of what they do, and the features they do find do not obviously faithfully and robustly map to human concepts, and even if they did it’s not clear that they compose in human-like ways. They are some evidence, but weak.
(In fact, we know that the fraction of features extracted is probably quite small—for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)
I would certainly agree that this evidence is new, preliminary, and not dispositive. But I would claim that it’s not at all what I’d expect to find in the most abstracted layer of something matching the following description:
Instead we’re finding stuff like that the Golden Gate Bridge is related to Alcatraz and San Francisco and to bridges and to tourist destinations: i.e. something that looks like a more abstract version of WordNet. . This is a semantic structure that looks like it should understand human metaphor and simile. And when we look for concepts that seem like they would be related to basic alignment issues, we can find them. (I also don’t view all this as very surprising, given how LLMs are trained, distilling their intelligence from humans, though I’m delighted to have it confirmed at scale.)
(I don’t offhand recall when that Eliezer quote comes from: the fact that this was going to work out this well for us was vastly less obvious, say, 5 years ago. and not exactly clear even a year ago: obviously Eliezer is allowed to update his worldview as discoveries are made, just like anyone else.)