I almost totally agree with this post. This comment is just nit picking and speculation.
Evolution has an other advantage, that is relate to “getting a lot’s of tries” but also importantly different.
It’s not just that evolution got to tinker a lot before landing on a fail proof solution. Evolution don’t even need a fail proof solution.
Evolution is “trying to find” a genome, which in interaction with reality, forms a brain that causes that human to have lots of kids. Evolution found a solution that mostly works, but sometimes don’t. Some humans decided that celibacy was the cool thing to do, or got too obsessed with something else to take the time to have a family. Note that this is different from how the recent distributional shift (mainly access to birth control, but also something about living in a rich country) have caused previously children rich populations to have on average less than replacement birth rate.
Evolution is fine with getting the alignment right in most of the minds, or even just a minority, if they are good enough at making babies. We might want better guarantees than that?
Going back to alignment with other humans. Evolution did not directly optimise for human to human alignment, but still produced humans that mostly care about other humans. Studying how this works seems like a great idea! But also evolution did not exactly nail human to human alignment. Most, but defiantly not all humans care about other humans. Ideally we want to build something much much more robust.
Crazy (probably bad) idea: If we can build a AI design + training regime that mostly but not certainly turn out human aligned AIs, and where the uncertainty is mostly random noise that is uncorrelated between AIs. Then maybe we should build lots of AIs with similar power and hope that because the majority are aligned, this will turn out fine for us. Like how you don’t need every single person in a country to care about animals, in order for that country to implement animal protection laws.
But also evolution did not exactly nail human to human alignment. Most, but defiantly not all humans care about other humans.
Here’s a consideration which Quintin pointed out. It’s actually a good thing that there is variance in human altruism/caring. Consider a uniform random sample of 1024 people, and grade them by how altruistic / caring they are (in whatever sense you care to consider). The most aligned and median-aligned people will have a large gap. Therefore, by applying only 10 bits of optimization pressure to the generators of human alignment (in the genome+life experiences), you can massively increase the alignment properties of the learned values. This implies that it’s relatively easy to optimize for alignment (in the human architecture & if you know what you’re doing).
Conversely, people have ~zero variance in how well they can fly. If it were truly hard (in theory) to improve the alignment of a trained policy, people would exhibit far less variance in their altruism, which would be bad news for training an AI which is even more altruistic than people are.
What if I push this line of thinking to the extreme. If I just pick agents randomly from the space of all agents, then this should be maximally random, and that should be even better. Now the part where we can mine information of alignment from the fact that humans are at least some what aligned is gone. So this seems wrong. What is wrong here? Probably the fact that if you pick agents randomly from the space of all agents, you don’t get greater variation of aliment, compare to if you pick random humans, because probably all the random agents you pick are just non aligned.
So what is doing most of the work here is that humans are more aligned than random. Which I expect you to agree on. What you are also saying (I think) is that the tale end level of alignment in humans is more important in some way than the mean or average level of aliment in humans. Because if we have the human distribution, we are just a few bits from locating the tail of the distribution. E.g. we are 10 bits away from locating the top 0.1 percentile. And because the tail is what matters, randomness is in our favor.
I almost totally agree with this post. This comment is just nit picking and speculation.
Evolution has an other advantage, that is relate to “getting a lot’s of tries” but also importantly different.
It’s not just that evolution got to tinker a lot before landing on a fail proof solution. Evolution don’t even need a fail proof solution.
Evolution is “trying to find” a genome, which in interaction with reality, forms a brain that causes that human to have lots of kids. Evolution found a solution that mostly works, but sometimes don’t. Some humans decided that celibacy was the cool thing to do, or got too obsessed with something else to take the time to have a family. Note that this is different from how the recent distributional shift (mainly access to birth control, but also something about living in a rich country) have caused previously children rich populations to have on average less than replacement birth rate.
Evolution is fine with getting the alignment right in most of the minds, or even just a minority, if they are good enough at making babies. We might want better guarantees than that?
Going back to alignment with other humans. Evolution did not directly optimise for human to human alignment, but still produced humans that mostly care about other humans. Studying how this works seems like a great idea! But also evolution did not exactly nail human to human alignment. Most, but defiantly not all humans care about other humans. Ideally we want to build something much much more robust.
Crazy (probably bad) idea: If we can build a AI design + training regime that mostly but not certainly turn out human aligned AIs, and where the uncertainty is mostly random noise that is uncorrelated between AIs. Then maybe we should build lots of AIs with similar power and hope that because the majority are aligned, this will turn out fine for us. Like how you don’t need every single person in a country to care about animals, in order for that country to implement animal protection laws.
Here’s a consideration which Quintin pointed out. It’s actually a good thing that there is variance in human altruism/caring. Consider a uniform random sample of 1024 people, and grade them by how altruistic / caring they are (in whatever sense you care to consider). The most aligned and median-aligned people will have a large gap. Therefore, by applying only 10 bits of optimization pressure to the generators of human alignment (in the genome+life experiences), you can massively increase the alignment properties of the learned values. This implies that it’s relatively easy to optimize for alignment (in the human architecture & if you know what you’re doing).
Conversely, people have ~zero variance in how well they can fly. If it were truly hard (in theory) to improve the alignment of a trained policy, people would exhibit far less variance in their altruism, which would be bad news for training an AI which is even more altruistic than people are.
(Just typing as I think...)
What if I push this line of thinking to the extreme. If I just pick agents randomly from the space of all agents, then this should be maximally random, and that should be even better. Now the part where we can mine information of alignment from the fact that humans are at least some what aligned is gone. So this seems wrong. What is wrong here? Probably the fact that if you pick agents randomly from the space of all agents, you don’t get greater variation of aliment, compare to if you pick random humans, because probably all the random agents you pick are just non aligned.
So what is doing most of the work here is that humans are more aligned than random. Which I expect you to agree on. What you are also saying (I think) is that the tale end level of alignment in humans is more important in some way than the mean or average level of aliment in humans. Because if we have the human distribution, we are just a few bits from locating the tail of the distribution. E.g. we are 10 bits away from locating the top 0.1 percentile. And because the tail is what matters, randomness is in our favor.
Does this capture what you are tying to say?