wassname comments on AI Alignment: A Comprehensive Survey

wassname 21 Apr 2024 11:08 UTC
2 points
0
This is pretty good. It has a lot in it, being a grab bag of things. I particularly enjoyed the scalable oversight sections which succinctly explained debate, recursive reward modelling etc. There were also some gems I hadn’t encountered before, like the concept of training out agentic behavior by punishing side-effects.

If anyone wants the HTML version of the paper, it is here.