More generally: natural latents are typically only of interest “up to isomorphism”—anything which represents exactly the same information is effectively the same latent.
Note that from an alignment perspective, this is potentially a problem. Even if “good” is a natural abstraction, “seems superficially good but is actually bad” is also a natural abstraction, and so if you want to identify the “good” abstraction” within a learned model, you have some work cut out to separate it from f(good, seems superficially good but is actually bad), for arbitrary values of f.
I think different ad-hoc alignment ideas can be fruitfully understood in terms of different assumptions they make in order to solve this problem. For instance my guess is that weak-to-strong generalization works insofar as “good” is represented at much greater scale than “seems superficially good but is actually bad” (because the gradient update for a weight is proportional to the activation of the neuron upstream of that weight).
Under this assumption, for e.g. shard theory to succeed, I guess they would have to create models for the magnitudes of different internal representations. My first guess would be that the magnitude would grow with the number of training samples that need the natural abstraction. (Which in turn makes it a major flaw that the strong network was only trained on labels and not data generated by the weak network. Real foundations models are trained on a mixture of human and non-human data, rather than purely on non-human data. Analogously, the strong network should be trained on a mixture of data generated by the weak network and by humans.)
I suspect this problem will be easily solvable for some concepts (e.g. “diamond”) and much harder to solve for other concepts (e.g. “human values”), but I don’t think we entirely know how far it will reach.
Note that from an alignment perspective, this is potentially a problem. Even if “good” is a natural abstraction, “seems superficially good but is actually bad” is also a natural abstraction, and so if you want to identify the “good” abstraction” within a learned model, you have some work cut out to separate it from f(good, seems superficially good but is actually bad), for arbitrary values of f.
I think different ad-hoc alignment ideas can be fruitfully understood in terms of different assumptions they make in order to solve this problem. For instance my guess is that weak-to-strong generalization works insofar as “good” is represented at much greater scale than “seems superficially good but is actually bad” (because the gradient update for a weight is proportional to the activation of the neuron upstream of that weight).
Under this assumption, for e.g. shard theory to succeed, I guess they would have to create models for the magnitudes of different internal representations. My first guess would be that the magnitude would grow with the number of training samples that need the natural abstraction. (Which in turn makes it a major flaw that the strong network was only trained on labels and not data generated by the weak network. Real foundations models are trained on a mixture of human and non-human data, rather than purely on non-human data. Analogously, the strong network should be trained on a mixture of data generated by the weak network and by humans.)
I suspect this problem will be easily solvable for some concepts (e.g. “diamond”) and much harder to solve for other concepts (e.g. “human values”), but I don’t think we entirely know how far it will reach.