An Inside View of AI Alignment
I started to take AI Alignment seriously around early 2020. I’d been interested in AI and machine learning in particular since 2014 or so, taking several online ML courses in high school and implementing some simple models for various projects. I leaned into the same niche in college, taking classes in NLP, Computer Vision, and Deep Learning to learn more of the underlying theory and modern applications of AI, with a continued emphasis on ML. I was very optimistic about AI capabilities then (and still am) and if you’d asked me about AI alignment or safety as late as my sophomore year of college (2018-2019), I probably would have quoted Steven Pinker or Andrew Ng at you.
Somewhere in the process of reading The Sequences, portions of the AI Foom Debate, and texts like Superintelligence and Human Compatible, I changed my mind. Some 80,000 hours podcast episodes were no doubt influential as well, particularly the episodes with Paul Christiano. By late 2020, I probably took AI risk as seriously as I do today, believing it to be one of the world’s most pressing problems (perhaps the most) and was interested in learning more about it. I binged most of the sequences on the Alignment Forum at this point, learning about proposals and concepts like IDA, Debate, Recursive Reward Modeling, Embedded Agency, Attainable Utility Preservation, CIRL etc. Throughout 2021 I continued to keep a finger on the pulse of the field: I got a large amount of value out of the Late 2021 MIRI Conversations in particular, shifting away from a substantial amount of optimism in prosaic alignment methods, slower takeoff speeds, longer timelines, and a generally “Christiano-ish” view of the field and more towards a “Yudkowsky-ish” position.
I had a vague sense that AI safety would eventually be the problem I wanted to work on in my life, but going through the EA Cambridge AGI Safety Fundamentals Course helped make it clear that I could productively contribute to AI safety work right now or in the near future. This sequence is going to be an attempt to explicate my current model or “inside view” of the field. These viewpoints have been developed over several years and are no doubt influenced by my path into and through AI safety research: for example, I tend to take aligning modern ML models extremely seriously, perhaps more seriously than is deserved, because of my greater amount of experience with ML compared to other AI paradigms.
I’m writing with the express goal of having my beliefs critiqued and scrutinized: there’s a lot I don’t know and no doubt a large amount that I’m misunderstanding. I plan on writing on a wide variety of topics: the views of various researchers, my understanding and confidence in specific alignment proposals, timelines, takeoff speeds, the scaling hypothesis, interpretability, etc. I also don’t have a fixed timeline or planned order in which I plan to publish different pieces of the model.
Without further ado, the posts that follow comprise Ansh’s (current) Inside View of AI Alignment.
Welcome to LessWrong, and look forward to seeing your thoughts laid out. :)
I’m excited to read your work! I would also like to post my inside view on LessWrong later, once it is more developed.