I’d say the big example of something that plausibly has a negative alignment tax, or at least far less taxes than usual alignment plans, is making large synthetic datasets showing correct human values, and/or synthetic datasets that always have a much more powerful AI obey humans.
Constitiutional AI IMO is the very, very primitive start of how this sort of plan could go.
The basic reason for why it has so much tractability in my mind starts from 2 things I realized mattered for alignment.
1 is that the architecture matters less than the data for questions of OOD generalization, and thus whoever controls the dataset has a powerful way to control what it values, even when it’s trying to generalize OOD.
But even though I might quibble with the strongest versions of the concept, I do think the insight that the data matters much more than the architecture even for things like AGI/ASI immediately suggests a plausible alignment method, which is to train them with massive synthetic data sets on what human values, and also have data about AIs that are way more powerful than humans always obeying what a human orders them to do, which @Seth Herd calls instruction following AGIs could well be very valuable.
2 is that synthetic data is a natural path for people like OpenAI to work on to increase capabilities, and a lot of the papers claiming model collapse have extraordinarily unrealistic assumptions, like always removing all the real data, which makes me skeptical of model collapse occuring IRL.
For this reason, it’s easier to get selfish actors to subscribe to this alignment approach than some other alignment approaches that have been proposed.
There is another point to keep in mind:
As it turns out, alignment probably generalizes further than capabilities, because learning and valuing what humans value is very easy, since there is a whole lot of data on what humans value, and you never actually have to execute on those values IRL, but it’s harder to make an AI that is robustly capable in the real world, which means that value learning comes before capabilities, and more importantly, it’s looking like a lot of the evopsych assumptions about how humans got their capabilities and values is looking very wrong, and that matters, since under something like a Universal Learning Machine hypothesis, a whole of the complexity of human values is in the data, not the algorithm or code, and thus it’s feasible to make AIs learn what we value in a simple fashion, because our values aren’t nearly as complicated as evopsych suggests:
But what that means is that actually being aligned isn’t massively disfavored even under simplicity priors, and it can be gotten into a basin of alignment quite early in training with reasonably small datasets.
I’ve become more bearish on RLHF post-training methods, but your post is well taken here.
Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.
Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.
Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.
Indeed, I got that point exactly from Beren, thanks for noticing.
The evopsych assumptions I claim are false are the following:
That most of how humans learn is through very specific modules, and in particular that most of the learning is not through general purpose algorithms that learn from data, but are instead specified by the genome for the most part, and that the human mind is a complex messy cludge of evolved mechanisms.
Following that, the other assumption that I think is false is that there is a very complicated way in how humans are pro-social, and that the pro-social algorithms you attest to are very complicated kludges, but instead very general and simple algorithms where the values and pro-social factors of humans are learned mostly from data.
Essentially, I’m arguing the view that most of the complexity of the pro-social algorithms/values we learn is not due to the genome’s inherent complexity, under evopsych, but rather that the data determines most of what you value, and most of the complexity comes from the data, not the prior.
I’d say the big example of something that plausibly has a negative alignment tax, or at least far less taxes than usual alignment plans, is making large synthetic datasets showing correct human values, and/or synthetic datasets that always have a much more powerful AI obey humans.
Constitiutional AI IMO is the very, very primitive start of how this sort of plan could go.
The basic reason for why it has so much tractability in my mind starts from 2 things I realized mattered for alignment.
1 is that the architecture matters less than the data for questions of OOD generalization, and thus whoever controls the dataset has a powerful way to control what it values, even when it’s trying to generalize OOD.
A very strong version of this is provided here:
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
But even though I might quibble with the strongest versions of the concept, I do think the insight that the data matters much more than the architecture even for things like AGI/ASI immediately suggests a plausible alignment method, which is to train them with massive synthetic data sets on what human values, and also have data about AIs that are way more powerful than humans always obeying what a human orders them to do, which @Seth Herd calls instruction following AGIs could well be very valuable.
2 is that synthetic data is a natural path for people like OpenAI to work on to increase capabilities, and a lot of the papers claiming model collapse have extraordinarily unrealistic assumptions, like always removing all the real data, which makes me skeptical of model collapse occuring IRL.
For this reason, it’s easier to get selfish actors to subscribe to this alignment approach than some other alignment approaches that have been proposed.
There is another point to keep in mind:
As it turns out, alignment probably generalizes further than capabilities, because learning and valuing what humans value is very easy, since there is a whole lot of data on what humans value, and you never actually have to execute on those values IRL, but it’s harder to make an AI that is robustly capable in the real world, which means that value learning comes before capabilities, and more importantly, it’s looking like a lot of the evopsych assumptions about how humans got their capabilities and values is looking very wrong, and that matters, since under something like a Universal Learning Machine hypothesis, a whole of the complexity of human values is in the data, not the algorithm or code, and thus it’s feasible to make AIs learn what we value in a simple fashion, because our values aren’t nearly as complicated as evopsych suggests:
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
But what that means is that actually being aligned isn’t massively disfavored even under simplicity priors, and it can be gotten into a basin of alignment quite early in training with reasonably small datasets.
I’ve become more bearish on RLHF post-training methods, but your post is well taken here.
Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.
Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.
Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.
Indeed, I got that point exactly from Beren, thanks for noticing.
The evopsych assumptions I claim are false are the following:
That most of how humans learn is through very specific modules, and in particular that most of the learning is not through general purpose algorithms that learn from data, but are instead specified by the genome for the most part, and that the human mind is a complex messy cludge of evolved mechanisms.
Following that, the other assumption that I think is false is that there is a very complicated way in how humans are pro-social, and that the pro-social algorithms you attest to are very complicated kludges, but instead very general and simple algorithms where the values and pro-social factors of humans are learned mostly from data.
Essentially, I’m arguing the view that most of the complexity of the pro-social algorithms/values we learn is not due to the genome’s inherent complexity, under evopsych, but rather that the data determines most of what you value, and most of the complexity comes from the data, not the prior.
Cf this link:
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine