A common response is that “evaluation may be easier than generation”. However, this doesn’t mean evaluation will be easy in absolute terms, or relative to one’s resources for doing it, or that it will depend on the same resources as generation.
I wonder to what degree this is true for the human-generated alignment ideas that are being submitted LessWrong/Alignment Forum?
For mathematical proofs, evaluation is (imo) usually easier than generation: Often, a well-written proof can be evaluated by reading it once, but often the person who wrote up the proof had to consider different approaches and discard a lot of them.
To what degree does this also hold for alignment research?
I wonder to what degree this is true for the human-generated alignment ideas that are being submitted LessWrong/Alignment Forum?
For mathematical proofs, evaluation is (imo) usually easier than generation: Often, a well-written proof can be evaluated by reading it once, but often the person who wrote up the proof had to consider different approaches and discard a lot of them.
To what degree does this also hold for alignment research?
There is an argument that evaluating AI models should be formalised, i.e., turned into verification: see https://arxiv.org/abs/2309.01933 (and discussion on Twitter with Yudkowsky and Davidad).