Ebenezer Dukakis comments on Prospects for Alignment Automation: Interpretability Case Study

Ebenezer Dukakis 28 Mar 2025 1:48 UTC
1 point
0

I am optimistic that further thinking on automation prospects could identify other automation-tractable areas of alignment and control (e.g. see here for previous work).

This tag might be helpful: https://www.lesswrong.com/w/ai-assisted-alignment

Here’s a recent shortform on the topic: https://www.lesswrong.com/posts/mKgbawbJBxEmQaLSJ/davekasten-s-shortform?commentId=32jReMrHDd5vkDBwt

I wonder about getting an LLM to process LW archive posts, and tag posts which contain alignment ideas that seem automatable.