I am also interested in interpretable ML. I am developing artificial semiosis, a human-like AI training process which can achieve aligned (transparency-based, interpretability-based) cognition. You can find an example of the algorithms I am making here: the AI runs a non-deep-learning algorithm, does some reflection and forms a meaning for someone “saying” something, a meaning different from the usual meaning for humans, but perfectly interpretable.
I support then the case for differential technological development:
There are two counter-arguments to this that I’m aware of, that I don’t think in themselves justify not working on this.
Regarding 1, it may take several years to have interpretable ML reach capabilities equivalent to LLMs, but the future may offer surprises either in terms of coordination to pause the development of “opaque” advanced AI or of deep learning hitting a wall… at killing everyone. Let’s have a plan also for the case we are still alive.
Regarding 2, interpretable ML would need to have programmed control mechanisms to be aligned. There is currently no such a field of AI safety as we do not have yet interpretable ML, but I imagine computer engineers being able to make progress on these control mechanisms (being able to make more progress than on mechanistic interpretability of LLMs). While it is true that control mechanisms can be disabled, you can always advocate for the highest security (like in Ian Hogarth’s Island idea). You can then also reject this counterargument.
mishka noted that this paradigm of AI is more foomable. Self-modification is a huge problem. I have an intuition interpretable ML will exhibit a form of scaffolding, in that control mechanisms for robustness (i.e. for achieving capabilities) can advantageously double as alignment mechanisms. Thanks to interpretable ML, engineers may be able to study self-modification already in systems with limited capabilities and learn the right constraints.
I like your arguments on AGI timelines, but the last section of your post feels like you are reflecting on something I would call “civilization improvement” rather than on a 20+ years plan for AGI alignment.
I am a bit confused by the way you are conflating “civilization improvement” with a strategy for alignment (when you discuss enhanced humans solving alignment, or discuss empathy in communicating a message “If you and people you know succeed at what you’re trying to do, everyone will die”). Yes, given longer timelines, civilization improvement can play a big role in reducing existential risk including AGI x-risk, but I would prefer to sell the broad merits of interventions on their own, rather than squeeze them into a strategy for alignment from today’s limited viewpoint. When making a multi-decade plan for civilization improvement, I think it is also important to consider the possibility of AGI-driven “civilization improvement”, i.e. interventions will not only influence AGI development, but they may also be critically influenced by it.
Finally, when considering strategy for alignment under longer timelines, people can have useful non-standard insights, see for example this discussion on AGI paradigms and this post on agent foundations research.