That post proposes calling new instances or different models to review plans and internal dialogue for alignment, but it includes discussion of the several layers of safety scaffolding that have been proposed elsewhere.
This post is amazingly useful. Integrative/overview work is often thankless, but I think it’s invaluable for understanding where effort is going, and thinking about gaps where it more should be devoted. So thank you, thank you!
This is an excellent description of my primary work, for example
Internal independent review for language model agent alignment
That post proposes calling new instances or different models to review plans and internal dialogue for alignment, but it includes discussion of the several layers of safety scaffolding that have been proposed elsewhere.
This post is amazingly useful. Integrative/overview work is often thankless, but I think it’s invaluable for understanding where effort is going, and thinking about gaps where it more should be devoted. So thank you, thank you!