I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it’s pretty likely that we’ll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don’t want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it’s pretty likely that we’ll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don’t want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.