But also evolution did not exactly nail human to human alignment. Most, but defiantly not all humans care about other humans.
Here’s a consideration which Quintin pointed out. It’s actually a good thing that there is variance in human altruism/caring. Consider a uniform random sample of 1024 people, and grade them by how altruistic / caring they are (in whatever sense you care to consider). The most aligned and median-aligned people will have a large gap. Therefore, by applying only 10 bits of optimization pressure to the generators of human alignment (in the genome+life experiences), you can massively increase the alignment properties of the learned values. This implies that it’s relatively easy to optimize for alignment (in the human architecture & if you know what you’re doing).
Conversely, people have ~zero variance in how well they can fly. If it were truly hard (in theory) to improve the alignment of a trained policy, people would exhibit far less variance in their altruism, which would be bad news for training an AI which is even more altruistic than people are.
What if I push this line of thinking to the extreme. If I just pick agents randomly from the space of all agents, then this should be maximally random, and that should be even better. Now the part where we can mine information of alignment from the fact that humans are at least some what aligned is gone. So this seems wrong. What is wrong here? Probably the fact that if you pick agents randomly from the space of all agents, you don’t get greater variation of aliment, compare to if you pick random humans, because probably all the random agents you pick are just non aligned.
So what is doing most of the work here is that humans are more aligned than random. Which I expect you to agree on. What you are also saying (I think) is that the tale end level of alignment in humans is more important in some way than the mean or average level of aliment in humans. Because if we have the human distribution, we are just a few bits from locating the tail of the distribution. E.g. we are 10 bits away from locating the top 0.1 percentile. And because the tail is what matters, randomness is in our favor.
Here’s a consideration which Quintin pointed out. It’s actually a good thing that there is variance in human altruism/caring. Consider a uniform random sample of 1024 people, and grade them by how altruistic / caring they are (in whatever sense you care to consider). The most aligned and median-aligned people will have a large gap. Therefore, by applying only 10 bits of optimization pressure to the generators of human alignment (in the genome+life experiences), you can massively increase the alignment properties of the learned values. This implies that it’s relatively easy to optimize for alignment (in the human architecture & if you know what you’re doing).
Conversely, people have ~zero variance in how well they can fly. If it were truly hard (in theory) to improve the alignment of a trained policy, people would exhibit far less variance in their altruism, which would be bad news for training an AI which is even more altruistic than people are.
(Just typing as I think...)
What if I push this line of thinking to the extreme. If I just pick agents randomly from the space of all agents, then this should be maximally random, and that should be even better. Now the part where we can mine information of alignment from the fact that humans are at least some what aligned is gone. So this seems wrong. What is wrong here? Probably the fact that if you pick agents randomly from the space of all agents, you don’t get greater variation of aliment, compare to if you pick random humans, because probably all the random agents you pick are just non aligned.
So what is doing most of the work here is that humans are more aligned than random. Which I expect you to agree on. What you are also saying (I think) is that the tale end level of alignment in humans is more important in some way than the mean or average level of aliment in humans. Because if we have the human distribution, we are just a few bits from locating the tail of the distribution. E.g. we are 10 bits away from locating the top 0.1 percentile. And because the tail is what matters, randomness is in our favor.
Does this capture what you are tying to say?