This post introduces Timaeus’ “Developmental Interpretability” research agenda. The latter is IMO one of the most interesting extant AI alignment research agendas.
The reason DevInterp is interesting is that it is one of the few AI alignment research agendas that is trying to understand deep learning “head on”, while wielding a powerful mathematical tool that seems potentially suitable for the purpose (namely, Singular Learning Theory). Relatedly, it is one of the few agendas that maintains a strong balance of theoretical and empirical research. As such, it might also grow to be a bridge between theoretical and empirical research agendas more broadly (e.g. it might be synergistic with the LTA).
I also want to point out a few potential weaknesses or (minor) reservations I have:
First, DevInterp places phase transitions as its central object of study. While I agree that phase transitions seem interesting, possibly crucial to understand, I’m not convinced that a broader view wouldn’t be better.
Singular Learning Theory (SLT) has the potential to explain generalization in deep learning, phase transitions or no. This in itself seems to be important enough to deserve the central stage. Understanding generalization is crucial, because:
We want our alignment protocols to generalize correctly, given the available data, compute and other circumstances, and we need to understand what conditions would guarantee it (or at least prohibit catastrophic generalization failures).
If the resulting theory of generalization is in some sense universal, then it might be applicable to specifying a procedure for inferring human values (as human behavior is generated from human values by a learning algorithm with similar generalization properties), or at least formalizing “human values” well enough for theoretical analysis of alignment.
Hence, compared to the OP, I would put more emphasis on these latter points.
Second, the OP does mention the difference between phase transitions during Stochastic Gradient Descent (SGD) and the phase transitions of Singular Learning Theory, but this deserves a closer look. SLT has IMO two key missing pieces:
The first piece is the relation between ideal Bayesian inference (the subject of SLT) and SGD. Ideal Bayesian inference is known to be computationally intractable. Maybe there is an extension of SLT that replaces Bayesian inference with either SGD or a different tractable algorithm. For example, it could be some Markov Chain Monte Carlo (MCMC) that converges to Bayesian inference in the limit. Maybe there is a natural geometric invariant that controls the MCMC relaxation time, similarly to how the log canonical threshold controls sample complexity.
The second missing piece is understanding the special properties of ANN architectures compared to arbitrary singular hypothesis classes. For example, maybe there is some universality property which explains why e.g. transformers (or something similar) are qualitatively “as good as it gets”. Alternatively, it could be a relation between the log canonical threshold of specific ANN architectures to other simplicity measures which can be justified on other philosophical grounds.
That said, if the above missing pieces were found, SLT would become straightforwardly the theory for understanding deep learning and maybe learning in general.
This post introduces Timaeus’ “Developmental Interpretability” research agenda. The latter is IMO one of the most interesting extant AI alignment research agendas.
The reason DevInterp is interesting is that it is one of the few AI alignment research agendas that is trying to understand deep learning “head on”, while wielding a powerful mathematical tool that seems potentially suitable for the purpose (namely, Singular Learning Theory). Relatedly, it is one of the few agendas that maintains a strong balance of theoretical and empirical research. As such, it might also grow to be a bridge between theoretical and empirical research agendas more broadly (e.g. it might be synergistic with the LTA).
I also want to point out a few potential weaknesses or (minor) reservations I have:
First, DevInterp places phase transitions as its central object of study. While I agree that phase transitions seem interesting, possibly crucial to understand, I’m not convinced that a broader view wouldn’t be better.
Singular Learning Theory (SLT) has the potential to explain generalization in deep learning, phase transitions or no. This in itself seems to be important enough to deserve the central stage. Understanding generalization is crucial, because:
We want our alignment protocols to generalize correctly, given the available data, compute and other circumstances, and we need to understand what conditions would guarantee it (or at least prohibit catastrophic generalization failures).
If the resulting theory of generalization is in some sense universal, then it might be applicable to specifying a procedure for inferring human values (as human behavior is generated from human values by a learning algorithm with similar generalization properties), or at least formalizing “human values” well enough for theoretical analysis of alignment.
Hence, compared to the OP, I would put more emphasis on these latter points.
Second, the OP does mention the difference between phase transitions during Stochastic Gradient Descent (SGD) and the phase transitions of Singular Learning Theory, but this deserves a closer look. SLT has IMO two key missing pieces:
The first piece is the relation between ideal Bayesian inference (the subject of SLT) and SGD. Ideal Bayesian inference is known to be computationally intractable. Maybe there is an extension of SLT that replaces Bayesian inference with either SGD or a different tractable algorithm. For example, it could be some Markov Chain Monte Carlo (MCMC) that converges to Bayesian inference in the limit. Maybe there is a natural geometric invariant that controls the MCMC relaxation time, similarly to how the log canonical threshold controls sample complexity.
The second missing piece is understanding the special properties of ANN architectures compared to arbitrary singular hypothesis classes. For example, maybe there is some universality property which explains why e.g. transformers (or something similar) are qualitatively “as good as it gets”. Alternatively, it could be a relation between the log canonical threshold of specific ANN architectures to other simplicity measures which can be justified on other philosophical grounds.
That said, if the above missing pieces were found, SLT would become straightforwardly the theory for understanding deep learning and maybe learning in general.