That’s interesting re: LLMs as having “conceptual interpretability” by their very nature. I guess that makes sense, since some degree of conceptual interpretability naturally emerges given 1) sufficiently large and diverse training set, 2) sparsity constraints. LLMs are both—definitely #1, and #2 given regularization and some practical upper bounds on total number of parameters. And then there is your point—that LLMs are literally trained to create output we can interpret.
I wonder about representations formed by a shoggoth. For the most efficient prediction of what humans want to see, the shoggoth would seemingly form representations very similar to ours. Or would it? Would its representations be more constrained by and therefore shaped by its theory of human mind, or by its own affordances model? Like, would its weird alien worldview percolate into its theory-of-human-mind representations? Or would its alienness not be weird-theory-of-human-mind so much as everything else going on in shoggoth mind?
More generically, say there’s System X with at least moderate complexity. One generally intelligent creature learns to predict System X with N% accuracy, but from context A (which includes its purpose for learning System X / goals for it). Another generally intelligent creature learns how to predict System X with N% accuracy but from a very different context B (it has very different goals and a different background). To what degree would we expect their representations to be be similar / interpretable to one another? How does that change given the complexity of the system, the value of N, etc?
Anyway, I really just came here to drop this paper—https://arxiv.org/pdf/2311.13110.pdf—re: @Sodium’s wondering “some suitable loss function that incentivizes the model to represent its learned concepts in an easily readable form.” I’m curious about the same question, more from the applied standpoint of how to get a model to learn “good” representations faster. I haven’t played with it yet tho.
I tend to think that more-or-less how we interpret the world is the simplest way to interpret it (at least for the mesa-scale of people and technologies. I doubt there’s a dramatically different parsing that makes more sense. The world really seems to be composed of things made of things, that do things to things for reasons based on beliefs and goals. But this is an intuition.
Clever compressions of complex systems, and better representations of things outside of our evolved expertise, like particle phsyics, sociology and economics, seem quite possible.
Good citation; I meant to mention it. There’s a nice post on it.
If System X is of sufficient complexity / high dimensionality, it’s fair to say that there are many possible dimensional reductions, right? And not just globally better or worse options; instead, reductions that are more or less useful for a given context.
However, a shoggoth’s theory-of-human-mind context would probably be a lot of like our context, so it’d make sense that the representations would be similar.
That’s interesting re: LLMs as having “conceptual interpretability” by their very nature. I guess that makes sense, since some degree of conceptual interpretability naturally emerges given 1) sufficiently large and diverse training set, 2) sparsity constraints. LLMs are both—definitely #1, and #2 given regularization and some practical upper bounds on total number of parameters. And then there is your point—that LLMs are literally trained to create output we can interpret.
I wonder about representations formed by a shoggoth. For the most efficient prediction of what humans want to see, the shoggoth would seemingly form representations very similar to ours. Or would it? Would its representations be more constrained by and therefore shaped by its theory of human mind, or by its own affordances model? Like, would its weird alien worldview percolate into its theory-of-human-mind representations? Or would its alienness not be weird-theory-of-human-mind so much as everything else going on in shoggoth mind?
More generically, say there’s System X with at least moderate complexity. One generally intelligent creature learns to predict System X with N% accuracy, but from context A (which includes its purpose for learning System X / goals for it). Another generally intelligent creature learns how to predict System X with N% accuracy but from a very different context B (it has very different goals and a different background). To what degree would we expect their representations to be be similar / interpretable to one another? How does that change given the complexity of the system, the value of N, etc?
Anyway, I really just came here to drop this paper—https://arxiv.org/pdf/2311.13110.pdf—re: @Sodium’s wondering “some suitable loss function that incentivizes the model to represent its learned concepts in an easily readable form.” I’m curious about the same question, more from the applied standpoint of how to get a model to learn “good” representations faster. I haven’t played with it yet tho.
I tend to think that more-or-less how we interpret the world is the simplest way to interpret it (at least for the mesa-scale of people and technologies. I doubt there’s a dramatically different parsing that makes more sense. The world really seems to be composed of things made of things, that do things to things for reasons based on beliefs and goals. But this is an intuition.
Clever compressions of complex systems, and better representations of things outside of our evolved expertise, like particle phsyics, sociology and economics, seem quite possible.
Good citation; I meant to mention it. There’s a nice post on it.
If System X is of sufficient complexity / high dimensionality, it’s fair to say that there are many possible dimensional reductions, right? And not just globally better or worse options; instead, reductions that are more or less useful for a given context.
However, a shoggoth’s theory-of-human-mind context would probably be a lot of like our context, so it’d make sense that the representations would be similar.