It’s really great when alignment work is checked in toy models.
In this case, I was especially intrigued by the way it exposed how the different kinds of baselines influence behavior in gridworlds, and how it highlighted the difficulty of transitioning from a clean conceptual model to an implementation.
Also, the fact that a single randomly generated reward function was sufficient for implementing AUP in SafeLife is quite is quite astonishing. Another advantage of implementing your theorems—you get surprised by reality!
Unfortunately, some parts of the post would be unfit for inclusion in a book—gifs don’t work well on paper, maybe a workaround can be found (representing motion as arrows, or squares slightly shifting in color tone)? If the 2020 review ends up online, this is of course no problem.
The images were central to my understanding of the post, and the formula for RAUP is superb: explaining what different parts of the equation stand for ought to be standard!
It’s important that this post exists, but sufficiently technical that I think most people (maybe even alignment researchers? Not very sure about this) don’t need to read it, and including an explanation of AUPconceptual is far more important (like e.g. the comic parts of Reframing Impact, continuing the tradition of AIalignment comics in the review).
It’s really great when alignment work is checked in toy models.
In this case, I was especially intrigued by the way it exposed how the different kinds of baselines influence behavior in gridworlds, and how it highlighted the difficulty of transitioning from a clean conceptual model to an implementation.
Also, the fact that a single randomly generated reward function was sufficient for implementing AUP in SafeLife is quite is quite astonishing. Another advantage of implementing your theorems—you get surprised by reality!
Unfortunately, some parts of the post would be unfit for inclusion in a book—gifs don’t work well on paper, maybe a workaround can be found (representing motion as arrows, or squares slightly shifting in color tone)? If the 2020 review ends up online, this is of course no problem.
The images were central to my understanding of the post, and the formula for RAUP is superb: explaining what different parts of the equation stand for ought to be standard!
It’s important that this post exists, but sufficiently technical that I think most people (maybe even alignment researchers? Not very sure about this) don’t need to read it, and including an explanation of AUPconceptual is far more important (like e.g. the comic parts of Reframing Impact, continuing the tradition of AI alignment comics in the review).