Satron comments on AI Control: Improving Safety Despite Intentional Subversion

Satron 3 Jan 2025 10:55 UTC
11 points
6
Possibly one of the most impactful AI control papers of 2023. It went far beyond LessWrong, making into a separate 30-minute video dedicated to (positively) reviewing the proposed solution.

The paper also enjoyed some academic success. As of January 3, 2025, it not only has 23 citations on LessWrong, but also 24 citations on Google Scholar.

This paper strongly updated me towards thinking that AI control is possible, feasible and should be actively implemented to prevent catastrophic outcomes.