Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.
Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.
Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.
Indeed, I got that point exactly from Beren, thanks for noticing.
The evopsych assumptions I claim are false are the following:
That most of how humans learn is through very specific modules, and in particular that most of the learning is not through general purpose algorithms that learn from data, but are instead specified by the genome for the most part, and that the human mind is a complex messy cludge of evolved mechanisms.
Following that, the other assumption that I think is false is that there is a very complicated way in how humans are pro-social, and that the pro-social algorithms you attest to are very complicated kludges, but instead very general and simple algorithms where the values and pro-social factors of humans are learned mostly from data.
Essentially, I’m arguing the view that most of the complexity of the pro-social algorithms/values we learn is not due to the genome’s inherent complexity, under evopsych, but rather that the data determines most of what you value, and most of the complexity comes from the data, not the prior.
Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.
Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.
Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.
Indeed, I got that point exactly from Beren, thanks for noticing.
The evopsych assumptions I claim are false are the following:
That most of how humans learn is through very specific modules, and in particular that most of the learning is not through general purpose algorithms that learn from data, but are instead specified by the genome for the most part, and that the human mind is a complex messy cludge of evolved mechanisms.
Following that, the other assumption that I think is false is that there is a very complicated way in how humans are pro-social, and that the pro-social algorithms you attest to are very complicated kludges, but instead very general and simple algorithms where the values and pro-social factors of humans are learned mostly from data.
Essentially, I’m arguing the view that most of the complexity of the pro-social algorithms/values we learn is not due to the genome’s inherent complexity, under evopsych, but rather that the data determines most of what you value, and most of the complexity comes from the data, not the prior.
Cf this link:
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine