Industrial Engineering
Mechanistic Interpretability
AI Governance
Resilient Food Systems
Rationality
Miko Planas
I’m quite interested in further understanding how the naming of these features will scale once we increase the layers of the transformer model and increase the size of the sparse autoencoder. This sounds like the search space for the large model doing the autointerpreting will become massive. Intuitively, might this affect the reliability of the short descriptions generated by the autointerpretability model? Additionally, if the model being analyzed—as well as the sparse autoencoder—is sufficiently larger than the autointerpreting model, how will this affect the reliability of the generation?
I am interested in answering those questions because I am trying to understand the utility of learning a massive dictionary, as well as the feasibility of autogenerating descriptions for the features (engineering problem).
I primarily see mechanistic interpretability as a potential path towards understanding how models develop capabilities and processes—especially those that may represent misalignment. Hence, I view it as a means to monitor and align, not so much as to directly improve systems (unless of course we are able to include interpretability in the training loop).
How ambitious would it be to primarily focus on interpretability as an independent researcher (or as an employee/research engineer)?
If I’ve inferred correctly, one of this article’s goals is to increase the number of contributors in the space. I generally agree with how impactful interpretability can be, but I am little more risk averse when it comes to it being my career path.
For context, I have just graduated and I have a decent amount of experience in Python and other technologies. With those skills, I was hoping to tackle multiple low-hanging fruits in interpretability immediately, yet I am not 100% convinced in giving my all to these since I worry about my job security and chances of success.
1.) The space is both niche and research-oriented, which might not help me land future technical roles or higher education.
2.) I’ve anecdotally observed that most entry-level roles and fellowships in the space look for people with prior engineering work experience that is not often seen from fresh graduates. It might be hard to contribute and sustain myself from the get-go.
Does this mean I should be doing heavy engineering work before entering the interpretability space? Would it be possible for me to do both without sacrificing quality? If I do give 100% of my time to interpretability, would I still have other engineering job options if the space does not progress?
I am highly interested in knowing your thoughts on how younger devs/researchers can contribute without having to worry about job security, getting paid, or lacking future engineering skills (e.g. deployment, dev work outside notebook environments, etc.).
In any given field, the relative contributions of people who do and don’t know what’s going on will depend on (1) how hard it is to build some initial general models of what’s going on, (2) the abundance of “low-hanging fruit”, and (3) the quality of feedback loops, so people can tell when someone’s random stumbling has actually found something useful
Reading this, I instantly thought of high-impact complex problems with low tolerance which, in according with the Cynefin framework, is best dealt with initial probing and sensing. By definition, such environments/problems are not easily decomposed (and modeled), and are often characterized by emergent practices derived from experimentation. In the specific case that it is highly impactful but also does not allow for multiple failures, impact-oriented people are incentivized to work on the problem, but iteration is not possible.
In that situation, how can the impact-oriented people contribute—and at the same time combat impostor syndrome—if they cannot tangibly make themselves more experienced in the said problem? It seems that people won’t be able to correctly realize if they know what they are doing. Would it be best to LARP instead? Or, is it possible for these people to gain experience in parallel problems to combat impostor syndrome?
I’m quite curious about the possibility of frontier model training costs dropping, as a result of technological advancements in hardware. In the event that its possible, how long might it take for the advancements to be adopted by large-scale AI labs?
For future posts, I’d want to see more of the specifics of ML GPUs , and the rising alternatives (e.g. companies working on hardware, research, lab partnerships, etc.) that might make it faster and cheaper to train large models.
Are there alternatives to limiting the distribution/releasing model weights publicly?
Would it be possible to use other technologies and/or techniques (e.g. federated learning, smart contracts, cryptography, etc.) for interpretability researchers to still have access to weights with the presence of bad actors in mind?
I understand that the benefits of limiting outweighs the costs, but I still think leaving the capacity to perform safety research to centralized private orgs could result in less robust capabilities and/or alignment methods. In addition, I believe it’s not likely that releasing model weights will stop due to economic pressures (I need to research more on this).
I’m curious to hear about creative, subversive alternatives/theories that do not require limiting the distribution.