Post summary (feel free to suggest edits!): Last year, the author wrote up an plan they gave a “better than 50⁄50 chance” would work before AGI kills us all. This predicted that in 4-5 years, the alignment field would progress from preparadigmatic (unsure of the right questions or tools) to having a general roadmap and toolset.
They believe this is on track and give 40% likelihood that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets—with interpretability on the experimental side, in addition to theoretical work. This could lead to identifying which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them. They think we should then focus on those.
In their personal work, they’ve found theory work faster than expected, and crossing the theory-practice gap mildly slower. In 2022 most of their time went into theory work like the Basic Foundations sequence, workshops and conferences, training others, and writing up intro-level arguments on alignment strategies.
(If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
You might mention that the prediction about the next 1-2 years is only at 40% confidence, and the “8-year” part is an inside view whose corresponding outside view estimate is more like 10-15 years.
Post summary (feel free to suggest edits!):
Last year, the author wrote up an plan they gave a “better than 50⁄50 chance” would work before AGI kills us all. This predicted that in 4-5 years, the alignment field would progress from preparadigmatic (unsure of the right questions or tools) to having a general roadmap and toolset.
They believe this is on track and give 40% likelihood that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets—with interpretability on the experimental side, in addition to theoretical work. This could lead to identifying which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them. They think we should then focus on those.
In their personal work, they’ve found theory work faster than expected, and crossing the theory-practice gap mildly slower. In 2022 most of their time went into theory work like the Basic Foundations sequence, workshops and conferences, training others, and writing up intro-level arguments on alignment strategies.
(If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
You might mention that the prediction about the next 1-2 years is only at 40% confidence, and the “8-year” part is an inside view whose corresponding outside view estimate is more like 10-15 years.
Cheers, edited :)