Eli Tyre comments on [Link] A minimal viable product for alignment

Eli Tyre 27 Nov 2023 3:46 UTC
3 points
we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
Proposals generated by humans might contain honest mistakes, but they’re not very likely to be adversarially selected to look secure while actually not being secure.

We’re implicitly relying on the alignment of the human in our evaluation of human-generated alignment proposals. Even if we couldn’t tell the difference between the proposals that are safe.