Yes, I expect us to need some trusted data from humans. The cleverer we are the less we need. I think it’s reasonable to aim for quantity within 2 OOM of RLHF.
But… no, outer alignment is not a data quality problem, any more than outer alignment is a cosmic ray problem because if only the right cosmic rays hit my processor, it would be outer aligned.
You’re probably not the right target for this rant, but I typed it so oh well, sorry.
Yes, you could “just” obtain perfect labeled data about human actions, perfectly on-distribution, until a large NN converges, and get something that’s as aligned as a human on-distribution. But that’s not a real solution. Real solutions use obtainable amounts of data with obtainable quality, which requires being clever, which means doing all that thinking about outer alignment that isn’t just about data quality. Also, real solutions integrate work on on- and off-distribution alignment. You can’t just build something that generalizes poorly and then bolt generalization capabilities onto it afterward, you need to do outer alignment that includes desiderata for generalization properties.
I think that data quality is a helpful framing of outer alignment for a few reasons:
Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
If we had the perfect alignment solution on paper, we would still need to implement it. Since we don’t yet have the perfect alignment solution on paper, we should entertain the possibility that implementing it involves paying attention to data quality (whether in the sense of scalable oversight or in a more mundane sense).
It’s not a framing I’ve seen before, and I think it’s helpful to have different framings for things.
I do think that the framing is less helpful if the answer to my question is “not much”, but that’s currently still unclear to me, for the reasons I give in the post.
I agree that data quality doesn’t guarantee robustness, but that’s a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.
I think my big disagreement is with point one—yes, if you fix the architecture as something with bad alignment properties, then there is probably some dataset / reward signal that still gives you a good outcome. But this doesn’t work in real life, and it’s not something I see people working on such that there needs to be a word for it.
What deserves a word is people starting by thinking about both what we want the AI to learn and how, and picking datasets and architectures in tandem based on a theoretical story of how the AI is going to learn what we want it to.
A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.
I think it’s reasonable to aim for quantity within 2 OOM of RLHF.
Do you mean that on-paper solutions should aim to succeed with no more than 1⁄100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?
Yeah I just meant the upper bound of “within 2 OOM.” :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I’d be all for it.
I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that’s the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important.
As for why I think it’s possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it’s pretty likely that we’ll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don’t want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.
Yes, I expect us to need some trusted data from humans. The cleverer we are the less we need. I think it’s reasonable to aim for quantity within 2 OOM of RLHF.
But… no, outer alignment is not a data quality problem, any more than outer alignment is a cosmic ray problem because if only the right cosmic rays hit my processor, it would be outer aligned.
You’re probably not the right target for this rant, but I typed it so oh well, sorry.
Yes, you could “just” obtain perfect labeled data about human actions, perfectly on-distribution, until a large NN converges, and get something that’s as aligned as a human on-distribution. But that’s not a real solution. Real solutions use obtainable amounts of data with obtainable quality, which requires being clever, which means doing all that thinking about outer alignment that isn’t just about data quality. Also, real solutions integrate work on on- and off-distribution alignment. You can’t just build something that generalizes poorly and then bolt generalization capabilities onto it afterward, you need to do outer alignment that includes desiderata for generalization properties.
I think that data quality is a helpful framing of outer alignment for a few reasons:
Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
If we had the perfect alignment solution on paper, we would still need to implement it. Since we don’t yet have the perfect alignment solution on paper, we should entertain the possibility that implementing it involves paying attention to data quality (whether in the sense of scalable oversight or in a more mundane sense).
It’s not a framing I’ve seen before, and I think it’s helpful to have different framings for things.
I do think that the framing is less helpful if the answer to my question is “not much”, but that’s currently still unclear to me, for the reasons I give in the post.
I agree that data quality doesn’t guarantee robustness, but that’s a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.
I think my big disagreement is with point one—yes, if you fix the architecture as something with bad alignment properties, then there is probably some dataset / reward signal that still gives you a good outcome. But this doesn’t work in real life, and it’s not something I see people working on such that there needs to be a word for it.
What deserves a word is people starting by thinking about both what we want the AI to learn and how, and picking datasets and architectures in tandem based on a theoretical story of how the AI is going to learn what we want it to.
A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.
Do you mean that on-paper solutions should aim to succeed with no more than 1⁄100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?
Yeah I just meant the upper bound of “within 2 OOM.” :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I’d be all for it.
I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that’s the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important.
As for why I think it’s possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it’s pretty likely that we’ll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don’t want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.