This strikes me as a really interesting and innovative post, proposing a framework for systematically categorizing existing alignment proposals as well as helping to generate new ones.
I’m kind of surprised that this post is almost 2 years old and yet only has one pingback and a few comments.
Is there some other framework which has superseded this one, or did people just forget about it / there isn’t much comparative alignment work going on?
One other framework I’ve seen kind of like this is “Training stories” from Evan Hubinger’s How do we become confident in the safety of a machine learning system?. But that is more about evaluating alignment proposals (i.e. the very last part of the present post) rather than categorizing alignment proposals along a consistent set of dimensions, which is the main focus here. So it actually serves a different purpose and isn’t much like this framework after all.
This strikes me as a really interesting and innovative post, proposing a framework for systematically categorizing existing alignment proposals as well as helping to generate new ones.
I’m kind of surprised that this post is almost 2 years old and yet only has one pingback and a few comments.
Is there some other framework which has superseded this one, or did people just forget about it / there isn’t much comparative alignment work going on?
One other framework I’ve seen kind of like this is “Training stories” from Evan Hubinger’s How do we become confident in the safety of a machine learning system?. But that is more about evaluating alignment proposals (i.e. the very last part of the present post) rather than categorizing alignment proposals along a consistent set of dimensions, which is the main focus here. So it actually serves a different purpose and isn’t much like this framework after all.