Does he think this is a good presentation of his proposal?
I’m very glad johnswentworth wrote this, but there are a lot of little details where we seem to disagree—see my other comments in this thread. There are also a few key parts of my proposal not discussed in this post, such as active learning and using an ensemble to fight Goodharting and be more failure-tolerant. I don’t think there’s going to be a single natural abstraction for “human values” like johnswentworth seems to imply with this post, but I also think that’s a solvable problem.
I’m very glad johnswentworth wrote this, but there are a lot of little details where we seem to disagree—see my other comments in this thread. There are also a few key parts of my proposal not discussed in this post, such as active learning and using an ensemble to fight Goodharting and be more failure-tolerant. I don’t think there’s going to be a single natural abstraction for “human values” like johnswentworth seems to imply with this post, but I also think that’s a solvable problem.
(previous discussion for reference)