To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.
Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.
Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn’t necessarily trust them to generalize either.
Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over.
In fact, it is particularly in the case where you become disempowered that you would need the system’s help, so you would probably weight this priority more strongly than would be implied by the probability of becoming disempowered.
So people may under some conditions have an incentive to support systems that benefit others. And one such systems could be a general moral agreement that “everyone should be treated as having equal inherent worth, regardless of their power”.
Establishing such a norm will then tend to have knock-on effects outside of the original domain of application, e.g. granting support to people who have never been empowered. But the knock-on effects seem potentially highly contingent, and there are many degrees of freedom in how to generalize the norms.
This is not the only factor of course, I’m not claiming to have a comprehensive idea of how morality works.
Oh, you’re stating potential mechanisms for human alignment w/ humans that you don’t think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
A related point: humans don’t maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn’t push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.
There are other mechanisms, and I believe it’s imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
This research direction may become fruitful, but I think I’m less optimistic about it than you are. Evolution is capable of dealing with a lot of complexity, so it can have lots of careful correlations in its heuristics to make it robust. Evolution uses reality for experimentation, and has had a ton of tweaks that it has checked work correctly. And finally, this is one of the things that evolution is most strongly focused on handling.
Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn’t necessarily trust them to generalize either.
Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)
Could you elaborate?
One factor I think is relevant is:
Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over.
In fact, it is particularly in the case where you become disempowered that you would need the system’s help, so you would probably weight this priority more strongly than would be implied by the probability of becoming disempowered.
So people may under some conditions have an incentive to support systems that benefit others. And one such systems could be a general moral agreement that “everyone should be treated as having equal inherent worth, regardless of their power”.
Establishing such a norm will then tend to have knock-on effects outside of the original domain of application, e.g. granting support to people who have never been empowered. But the knock-on effects seem potentially highly contingent, and there are many degrees of freedom in how to generalize the norms.
This is not the only factor of course, I’m not claiming to have a comprehensive idea of how morality works.
Oh, you’re stating potential mechanisms for human alignment w/ humans that you don’t think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
A related point: humans don’t maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn’t push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.
There are other mechanisms, and I believe it’s imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems.
This research direction may become fruitful, but I think I’m less optimistic about it than you are. Evolution is capable of dealing with a lot of complexity, so it can have lots of careful correlations in its heuristics to make it robust. Evolution uses reality for experimentation, and has had a ton of tweaks that it has checked work correctly. And finally, this is one of the things that evolution is most strongly focused on handling.
But maybe you’ll find something useful there. 🤷