we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
Proposals generated by humans might contain honest mistakes, but they’re not very likely to be adversarially selected to look secure while actually not being secure.
We’re implicitly relying on the alignment of the human in our evaluation of human-generated alignment proposals. Even if we couldn’t tell the difference between the proposals that are safe.
Proposals generated by humans might contain honest mistakes, but they’re not very likely to be adversarially selected to look secure while actually not being secure.
We’re implicitly relying on the alignment of the human in our evaluation of human-generated alignment proposals. Even if we couldn’t tell the difference between the proposals that are safe.