This post is a really great summary. Steve’s posts on the thalamo-cortical-basal ganglia loop and how it relates learning-from-scratch to value learning and action selection has added a lot to my mental model of the brain’s overall cognitive algorithms. (Although I would add that I currently see the basal ganglia as a more general-purpose dynamic routing system, able to act as both multiplexer and demultiplexer between sets of cortical regions for implementing arbitrarily complex cognitive algorithms. That is, it may have evolved for action selection, but humans, at least, have repurposed it for abstract thought, moving our conscious thought processes toward acting more CPU-like. But that’s another discussion.)
One commonly-proposed solution to this problem is to capture these intuitions indirectly through human-in-the-loop-style proposals like imitative amplification, safety via debate, reward modeling, etc., but it might also be possible to just “cut out the middleman” and install the relevant human-like social-psychological computations directly into the AGI. In slogan form, instead of (or in addition to) putting a human in the loop, we could theoretically put “humanness” in our AGI.
In my opinion, it’s got to be both. It definitely makes sense that we should be trying to get an AGI to have value priors that align as closely with true human values as we can get them. However, to ensure that the system remains robust to arbitrarily high levels of intelligence, I think it’s critical to have a mechanism built in where the AGI is constantly trying to refine its model of human needs/goals and feed that in to how it steers its behavior. This would entail an ever-evolving theory of mind that uses human words, expressions, body language, etc. as Bayesian evidence of humans’ internal states and an action-selection architecture that is always trying to optimize for its current best model of human need/goal satisfaction.
Understanding the algorithms of the human brainstem/hypothalamus would help immensely in providing a good model prior (e.g., knowing that smiles are evidence for satisfied preferences, that frowns are evidence for violated preferences, and that humans should have access to food, water, shelter, and human community gives the AGI a much better head start than making it figure out all of that on its own). But it should still have the sort of architecture that would allow it to figure out human preferences from scratch and try its best to satisfy them in case we misspecify something in our model.
This post is a really great summary. Steve’s posts on the thalamo-cortical-basal ganglia loop and how it relates learning-from-scratch to value learning and action selection has added a lot to my mental model of the brain’s overall cognitive algorithms. (Although I would add that I currently see the basal ganglia as a more general-purpose dynamic routing system, able to act as both multiplexer and demultiplexer between sets of cortical regions for implementing arbitrarily complex cognitive algorithms. That is, it may have evolved for action selection, but humans, at least, have repurposed it for abstract thought, moving our conscious thought processes toward acting more CPU-like. But that’s another discussion.)
In my opinion, it’s got to be both. It definitely makes sense that we should be trying to get an AGI to have value priors that align as closely with true human values as we can get them. However, to ensure that the system remains robust to arbitrarily high levels of intelligence, I think it’s critical to have a mechanism built in where the AGI is constantly trying to refine its model of human needs/goals and feed that in to how it steers its behavior. This would entail an ever-evolving theory of mind that uses human words, expressions, body language, etc. as Bayesian evidence of humans’ internal states and an action-selection architecture that is always trying to optimize for its current best model of human need/goal satisfaction.
Understanding the algorithms of the human brainstem/hypothalamus would help immensely in providing a good model prior (e.g., knowing that smiles are evidence for satisfied preferences, that frowns are evidence for violated preferences, and that humans should have access to food, water, shelter, and human community gives the AGI a much better head start than making it figure out all of that on its own). But it should still have the sort of architecture that would allow it to figure out human preferences from scratch and try its best to satisfy them in case we misspecify something in our model.