mesaoptimizer comments on [outdated] My current theory of change to mitigate existential risk by misaligned ASI

mesaoptimizer 26 May 2023 15:40 UTC
3 points
0
Sorry for the late reply: I wrote up an answer but due to a server-side error during submission, I lost it. I shall answer the interpretability question first.

Interpretability didn’t make the list because of the following beliefs of mine:
- Interpretability—specifically interpretability-after-training—seems to aim, at the limit, for ontology identification, which is very different from ontological robustness. Ontology identification is useful for specific safety interventions such as scalable oversight, which seems like a viable alignment strategy, but I doubt this strategy scales until ASI. I expect it to break almost immediately as someone begins a human-in-the-loop RSI, especially since I expect (at the very least) significant changes in the architecture of neural network models that would result in capability improvements. This is why I predict that investing in interpretability research is not the best idea.
- A counterpoint is the notion that we can accelerate alignment with sufficiently capable aligned ‘oracle’ models—and this seems to be OpenAI’s current strategy: build ‘oracle’ models that are aligned enough to accelerate alignment research, and use better alignment techniques on the more capable models. Since one can both accelerate capabilities research and alignment research with capable enough oracle models, however, OpenAI would also choose to accelerate capabilities research alongside their attempt to accelerate alignment research. The question then is whether OpenAI is cautious enough as they balance out the two—and recent events have not made me optimistic about this being the case.
- Interpretability research does help accelerate some of the alignment agendas I have listed by providing insights that may be broad enough to help; but I expect that such insights to probably be found through other approaches too, and the fact that interpretability research either involves not working on more robust alignment plans or leads to capability insights, both seem to make me averse to considering working on interpretability research.
Here’s a few facets of interpretability research that I am enthusiastic about tracking, but not excited enough to want to work on, as of writing:
- Interpretability-during-training probably would be really useful, and I am more optimistic about it than interpretability-after-training. I expect that at the limit, interpretability-during-training leads to progress towards ensuring ontological robustness of values.
- Interpretability (both after-training and during-training) will help with detecting and making interventions when it comes to inner misalignment. That’s a great benefit, that I haven’t really thought about until I decided to reflect and answer your question.
- Interpretability research seems very focused on ‘oracles’—sequence modellers and supervised learning systems—and interpretability research on RL models seems neglected. I would like to see more research done on such models, because RL-style systems seems more likely to lead to RSI and ASI, and insights we gain might help alignment research in general.
I’m really glad you asked me this question! You’ve helped me elicit (and develop) a more nuanced view on interpretability research.