I think it’s reasonable to aim for quantity within 2 OOM of RLHF.
Do you mean that on-paper solutions should aim to succeed with no more than 1⁄100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?
Yeah I just meant the upper bound of “within 2 OOM.” :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I’d be all for it.
I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that’s the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important.
As for why I think it’s possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it’s pretty likely that we’ll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don’t want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.
Do you mean that on-paper solutions should aim to succeed with no more than 1⁄100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?
Yeah I just meant the upper bound of “within 2 OOM.” :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I’d be all for it.
I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that’s the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important.
As for why I think it’s possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it’s pretty likely that we’ll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don’t want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.