Oh, you’re stating potential mechanisms for human alignment w/ humans that you don’t think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
A related point: humans don’t maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn’t push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.
There are other mechanisms, and I believe it’s imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
This research direction may become fruitful, but I think I’m less optimistic about it than you are. Evolution is capable of dealing with a lot of complexity, so it can have lots of careful correlations in its heuristics to make it robust. Evolution uses reality for experimentation, and has had a ton of tweaks that it has checked work correctly. And finally, this is one of the things that evolution is most strongly focused on handling.
Oh, you’re stating potential mechanisms for human alignment w/ humans that you don’t think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
A related point: humans don’t maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn’t push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.
There are other mechanisms, and I believe it’s imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems.
This research direction may become fruitful, but I think I’m less optimistic about it than you are. Evolution is capable of dealing with a lot of complexity, so it can have lots of careful correlations in its heuristics to make it robust. Evolution uses reality for experimentation, and has had a ton of tweaks that it has checked work correctly. And finally, this is one of the things that evolution is most strongly focused on handling.
But maybe you’ll find something useful there. 🤷