The fact that latents are often related to their neighbors definitely seems to support your thesis, but it’s not clear to me that you couldn’t train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.
You could also play a similar game showing that latents in a larger SAE are “merely” compositions of latents in a smaller SAE.
So basically, I was left wanting a more mathematical perspective of what kinds of properties you’re hoping for SAEs (or meta-SAEs) and their latents to have.
It would be interesting to meditate in the question “What kind of training procedure could you use to get a meta-SAE directly?” And I think answering this relies in part on mathematical specification of what you want.
When you showed the decomposition of ‘einstein’, I also kinda wanted to see what the closest latents were in the object-level SAE to the components of ‘einstein’ in the meta-SAE.
It would be interesting to meditate in the question “What kind of training procedure could you use to get a meta-SAE directly?” And I think answering this relies in part on mathematical specification of what you want.
At Apollo we’re currently working on something that we think will achieve this. Hopefully will have an idea and a few early results (toy models only) to share soon.
but it’s not clear to me that you couldn’t train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.
IMO am “idealized” SAE just has no structure relating features, so nothing for a meta SAE to find. I’m not sure this is possible or desirable, to be clear! But I think that’s what idealized units of analysis should look like
You could also play a similar game showing that latents in a larger SAE are “merely” compositions of latents in a smaller SAE.
I agree, we do this briefly later in the post, I believe. I see our contribution more as showing that this kind of thing is possible, than that meta SAEs are objectively the best tool for it
The fact that latents are often related to their neighbors definitely seems to support your thesis, but it’s not clear to me that you couldn’t train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.
You could also play a similar game showing that latents in a larger SAE are “merely” compositions of latents in a smaller SAE.
So basically, I was left wanting a more mathematical perspective of what kinds of properties you’re hoping for SAEs (or meta-SAEs) and their latents to have.
It would be interesting to meditate in the question “What kind of training procedure could you use to get a meta-SAE directly?” And I think answering this relies in part on mathematical specification of what you want.
When you showed the decomposition of ‘einstein’, I also kinda wanted to see what the closest latents were in the object-level SAE to the components of ‘einstein’ in the meta-SAE.
At Apollo we’re currently working on something that we think will achieve this. Hopefully will have an idea and a few early results (toy models only) to share soon.
IMO am “idealized” SAE just has no structure relating features, so nothing for a meta SAE to find. I’m not sure this is possible or desirable, to be clear! But I think that’s what idealized units of analysis should look like
I agree, we do this briefly later in the post, I believe. I see our contribution more as showing that this kind of thing is possible, than that meta SAEs are objectively the best tool for it