Sure, but I think that human cognition tends to operate at a level of abstract above the configuration of atoms in a 3D environment. Like “that is a chair” is a useful way to reason about an environment. Whilethat “that is a configuration of pixels that corresponds to a chair when projected at a certain angle in certain lighting conditions” must first be converted to “that is a chair” before anything useful can be done. Text just has a lot of useful preprocessing applied already and is far more compressed
The preprocessing itself is one of the main important things we need to understand (I would even argue it’s the main important thing), if our interpretability methods are ever going to tell us about how the stuff-inside-the-net relates to the stuff-in-the-environment (which is what we actually care about).
I’m not sure I understand what you’re driving at, but as far as I do, here’s a response: I have lots of concepts and abstractions over the physical world (like chair). I don’t have many concepts or abstractions over strings of language, apart from as factored through the physical world. (I have some, like register or language, but they don’t actually feel that “final” as concepts).
As far as factoring my predictions of language through the physical world, a lot of the simplest and most robust concepts I have are just nouns, so they’re already represented by tokenisation machinery, and I can’t do interesting interp to pick them out.
Sure, but I think that human cognition tends to operate at a level of abstract above the configuration of atoms in a 3D environment. Like “that is a chair” is a useful way to reason about an environment. Whilethat “that is a configuration of pixels that corresponds to a chair when projected at a certain angle in certain lighting conditions” must first be converted to “that is a chair” before anything useful can be done. Text just has a lot of useful preprocessing applied already and is far more compressed
The preprocessing itself is one of the main important things we need to understand (I would even argue it’s the main important thing), if our interpretability methods are ever going to tell us about how the stuff-inside-the-net relates to the stuff-in-the-environment (which is what we actually care about).
I’m not sure I understand what you’re driving at, but as far as I do, here’s a response: I have lots of concepts and abstractions over the physical world (like chair). I don’t have many concepts or abstractions over strings of language, apart from as factored through the physical world. (I have some, like register or language, but they don’t actually feel that “final” as concepts).
As far as factoring my predictions of language through the physical world, a lot of the simplest and most robust concepts I have are just nouns, so they’re already represented by tokenisation machinery, and I can’t do interesting interp to pick them out.