Rafael Harth comments on Inner Alignment: Explain like I’m 12 Edition

Rafael Harth 13 Dec 2021 20:09 UTC
LW: 2 AF: 1
AF
Author here. One thing I think I’ve done wrong in the post is to equate black-box-search-in-large-parametrized-space with all of machine learning. I’ve now added this paragraph at the end of chapter 1:

Admittedly, the inner alignment model is not maximally general. In this post, we’ve looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep learning revolution, it may not describe what the field looks like in the future. In the context of black box search, inner alignment is a well-defined property and Venn-Diagram a valid way of slicing up the problem, but there are people who expect that AGI will not be built that way.^[1] There are even concrete proposals for safe AI where the concept doesn’t apply. Evan Hubinger has since written a follow-up post about what he calls “training stories”, which is meant to be “a general framework through which we can evaluate any proposal for building safe advanced AI”.

I also converted the post to markdown, mostly for the footnotes (the previous version just had little superscripts written via the math mode).
1. ↩︎
  If an AGI does contain more hand-coded parts, the picture gets more complicated. E.g., if a system is logically separated into a bunch of components, inner alignment may apply to some of the components but not others. It may even apply to parts of biological systems, see e.g., Steven Byrne’s Inner Alignment in the brain.