Thank you for the insightful post! You mentioned that:
Consider the relation a transformer has to an HMM that produced the data it was trained on. This is general—any dataset consisting of sequences of tokens can be represented as having been generated from an HMM.
and the linear projection consists of:
Linear regression from the residual stream activations (64 dimensional vectors) to the belief distributions (3 dimensional vectors).
Given any natural language dataset, if we didn’t have the ground truth belief distribution, is it possible to reverse engineer (data → model) a HMM and extract the topology of the residual stream activation?
I’ve been running task salient representation experiments on larger models and am very interested in replicating and possibly extending your result to more noisy settings.
If I’m understanding your question correctly, then the answer is yes, though in practice it might be difficult (I’m actually unsure how computationally intensive it would be, haven’t tried anything along these lines yet). This is definitely something to look into in the future!
Thank you for the insightful post! You mentioned that:
and the linear projection consists of:
Given any natural language dataset, if we didn’t have the ground truth belief distribution, is it possible to reverse engineer (data → model) a HMM and extract the topology of the residual stream activation?
I’ve been running task salient representation experiments on larger models and am very interested in replicating and possibly extending your result to more noisy settings.
If I’m understanding your question correctly, then the answer is yes, though in practice it might be difficult (I’m actually unsure how computationally intensive it would be, haven’t tried anything along these lines yet). This is definitely something to look into in the future!