This post explains a technique to use in AI alignment, that the author dubs “backchaining to local search” (where local search refers to techniques like gradient descent and evolutionary algorithms). The key idea is to take some proposed problem with AI systems, and figure out mechanistically how that problem could arise when running a local search algorithm. This can help provide information about whether we should expect the problem to arise in practice.
Planned opinion:
I’m a big fan of this technique: it has helped me expose many initially confused concepts, and notice that they were confused, particularly wireheading and inner alignment. It’s an instance of the more general technique (that I also like) of taking an abstract argument and making it more concrete and realistic, which often reveals aspects of the argument that you wouldn’t have previously noticed.
Planned summary for the Alignment Newsletter:
Planned opinion: