So I’m pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass.
I think I broadly agree with your points. I think I’m more imagining “similarity to humans” to mean “is well-described by shard theory; eg its later-network steering circuits are contextually activated based on a compositionally represented activation context.” This would align with greater activation-vector-steerability partway through language models (not the only source I have for that).
However, interpreting GPT: the logit lens and eg DoLA suggests that predictions are iteratively refined throughout the forward pass, whereas presumably shard theory (and inner optimizer threat models) would predict most sophisticated steering happens later in the network.
This is an excellent reply, thank you!
I think I broadly agree with your points. I think I’m more imagining “similarity to humans” to mean “is well-described by shard theory; eg its later-network steering circuits are contextually activated based on a compositionally represented activation context.” This would align with greater activation-vector-steerability partway through language models (not the only source I have for that).
However, interpreting GPT: the logit lens and eg DoLA suggests that predictions are iteratively refined throughout the forward pass, whereas presumably shard theory (and inner optimizer threat models) would predict most sophisticated steering happens later in the network.