Seth Herd comments on Shallow review of live agendas in alignment & safety

Seth Herd 27 Nov 2023 23:34 UTC
3 points
0
This is an excellent description of my primary work, for example

Internal independent review for language model agent alignment

That post proposes calling new instances or different models to review plans and internal dialogue for alignment, but it includes discussion of the several layers of safety scaffolding that have been proposed elsewhere.

This post is amazingly useful. Integrative/overview work is often thankless, but I think it’s invaluable for understanding where effort is going, and thinking about gaps where it more should be devoted. So thank you, thank you!