We can go look for such structures in e.g. nets, see how well they seem to match our own concepts, and have some reason to expect they’ll match our own concepts robustly in certain cases.
Checking my own understanding with an example of what this might look like concretely:
Suppose you have a language model that can play Chess (via text notation). Presumably, the model has some kind of internal representation of the game, the board state, the pieces, and strategy. Those representations are probably complicated linear combinations / superpositions of activations and weights within the model somewhere. Call this representation Λ′ in your notation.
If you just want a traditional computer program to play Chess you can use much simpler (or at least more bare metal / efficient) representations of the game, board state, and pieces as a 2-d array of integers or a bitmap or whatever, and write some relatively simple code to manipulate those data structures in ways that are valid according to the rules of Chess. Call this representation Λ in your notation.
And, to the degree that the language model is actually capable of playing valid Chess (since that’s when we would expect the preconditions to hold), you expect to be able to identify latents within the model and find a map from Λ′ to Λ, such that you can manipulate Λ and use information you learn from those manipulations to precisely predict stuff about Λ′. More concretely, once you have the map, you can predict the moves of the language model by inspecting its internals and then translating them into the representation used by an ordinary Chess analysis program, and then, having predicted the moves, you’ll be able to predict (and perhaps usefully manipulate) the language model’s internal representations by mapping from Λ back to Λ′.
And then the theorems are just saying under what conditions exactly you expect to be able to do this kind of thing, and it turns out those conditions are actually relatively lax.
Roughly accurate as an example / summary of the kind of thing you expect to be able to do?
Checking my own understanding with an example of what this might look like concretely:
Suppose you have a language model that can play Chess (via text notation). Presumably, the model has some kind of internal representation of the game, the board state, the pieces, and strategy. Those representations are probably complicated linear combinations / superpositions of activations and weights within the model somewhere. Call this representation Λ′ in your notation.
If you just want a traditional computer program to play Chess you can use much simpler (or at least more bare metal / efficient) representations of the game, board state, and pieces as a 2-d array of integers or a bitmap or whatever, and write some relatively simple code to manipulate those data structures in ways that are valid according to the rules of Chess. Call this representation Λ in your notation.
And, to the degree that the language model is actually capable of playing valid Chess (since that’s when we would expect the preconditions to hold), you expect to be able to identify latents within the model and find a map from Λ′ to Λ, such that you can manipulate Λ and use information you learn from those manipulations to precisely predict stuff about Λ′. More concretely, once you have the map, you can predict the moves of the language model by inspecting its internals and then translating them into the representation used by an ordinary Chess analysis program, and then, having predicted the moves, you’ll be able to predict (and perhaps usefully manipulate) the language model’s internal representations by mapping from Λ back to Λ′.
And then the theorems are just saying under what conditions exactly you expect to be able to do this kind of thing, and it turns out those conditions are actually relatively lax.
Roughly accurate as an example / summary of the kind of thing you expect to be able to do?