I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia). What we want is an AGI that can robustly identify humans (and I would argue, any agentic system), determine their values in an iteratively improving way, and treat these learned values as its own. That is, we should be looking for models where goal alignment and a desire to cooperate with humanity is situated within a broad basin of attraction (like how corrigibility is supposed to work), where any misalignment that the AGI notices (or that humans point out to it) is treated as an error signal that pulls its value model back into the basin. For such a scheme to work, of course, you need some way for it to infer human goals (watching human behavior?, imagining what it would be trying to achieve that would make it behave the same way?), some way for the AGI to represent “human goals” once it has inferred them, some way for it to represent “my own goals” in the same conceptual space (while still using those goal representations to drive its own behavior), and some way for it to take any differences in these representations to make itself more aligned (something like online gradient descent?).
And I think that solutions to this line of research would involve building generative agentic models into the AGI’s architecture to give it strong inductive priors for detecting human agency in its world model (using something along the lines of analysis by synthesis or predictive coding). We wouldn’t necessarily have to figure out everything about how the human mind works in order to build this (although that would certainly help), just enough so that it has the tools to teach itself how humans think and act, maintain homeostasis, generate new goals, use moral instincts of empathy, fairness, reciprocity, and status-seeking, etc. And as long as it is built to treat its best model of human values and goals as its own values and goals, I think we wouldn’t need to worry about it torturing simulated humans, no matter how sophisticated its agentic models get. Of course, this would require figuring out how to detect agentic models in general systems, as you mentioned, so that we can make sure that the only parts of the AGI capable of simulating agents are those that have their preferences routed to the AGI’s own preference modules.
I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia).
Interesting observation on the above post! Though I do not read it explicitly in John’s Plan, I guess you can indeed implicitly read that John’s Plan rejects routes to alignment that focus on control/myopia, routes that do not visit step 2.of successfully solving automatic/ambitious value learning first.
John, can you confirm this?
Background: my own default Plan does focus on control/myopia. I feel that this line of attack for solving AGI alignment (if we ever get weak or strong AGI) is reaching the stage where all the major points of ‘fundamental confusion’ have been solved. So for me this approach represents the true ‘easier strategy’.
It’s quite possible that control is easier than ambitious value learning, but I doubt that it’s as sustainable. Approaches like myopia, IDA, or HCH would probably get you an AGI that is aligned to much higher levels of intelligence than doing without them, all else being equal. But if there is nothing pulling its motivations explicitly back toward a basin of value alignment, then I feel like these approaches would be prone to diverging from alignment at some level beyond where any human could tell what’s going on with the system.
I do think that methods of control are worthwhile to pursue over the short term, but we had better be simultaneously working on ambitious value learning in the meantime for when an ASI inevitably escapes our control anyway. Even if myopia, for instance, worked perfectly to constrain what some AGI is able to conspire, it still seems likely that someone, somewhere, will try fiddling around with another AGI’s time horizon parameters and cause a disaster. It would be better if AGI models, from the beginning, had at least some value learning system built in by default to act as an extra safeguard.
I agree in general that pursuing multiple alternative alignment
approaches (and using them all together to create higher levels of
safety) is valuable. I am more optimistic than you that we can design
control systems (different from time horizon based myopia) which will
be stable and understandable even at higher levels of AGI competence.
it still seems likely that someone, somewhere, will try fiddling around with another AGI’s time horizon parameters and cause a disaster.
Well, if you worry about people fiddling with control system
tuning parameters, you also need to worry about someone fiddling with
value learning parameters so that the AGI will only learn the values
of a single group of people who would like to rule the rest of the
world. Assming that AGI is possible, I believe it is most likely
that Bostrom’s orthogonality hypothesis will hold for it. I am not
optimistic about desiging an AGI system which is inherently
fiddle-proof.
I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia). What we want is an AGI that can robustly identify humans (and I would argue, any agentic system), determine their values in an iteratively improving way, and treat these learned values as its own. That is, we should be looking for models where goal alignment and a desire to cooperate with humanity is situated within a broad basin of attraction (like how corrigibility is supposed to work), where any misalignment that the AGI notices (or that humans point out to it) is treated as an error signal that pulls its value model back into the basin. For such a scheme to work, of course, you need some way for it to infer human goals (watching human behavior?, imagining what it would be trying to achieve that would make it behave the same way?), some way for the AGI to represent “human goals” once it has inferred them, some way for it to represent “my own goals” in the same conceptual space (while still using those goal representations to drive its own behavior), and some way for it to take any differences in these representations to make itself more aligned (something like online gradient descent?).
And I think that solutions to this line of research would involve building generative agentic models into the AGI’s architecture to give it strong inductive priors for detecting human agency in its world model (using something along the lines of analysis by synthesis or predictive coding). We wouldn’t necessarily have to figure out everything about how the human mind works in order to build this (although that would certainly help), just enough so that it has the tools to teach itself how humans think and act, maintain homeostasis, generate new goals, use moral instincts of empathy, fairness, reciprocity, and status-seeking, etc. And as long as it is built to treat its best model of human values and goals as its own values and goals, I think we wouldn’t need to worry about it torturing simulated humans, no matter how sophisticated its agentic models get. Of course, this would require figuring out how to detect agentic models in general systems, as you mentioned, so that we can make sure that the only parts of the AGI capable of simulating agents are those that have their preferences routed to the AGI’s own preference modules.
Interesting observation on the above post! Though I do not read it explicitly in John’s Plan, I guess you can indeed implicitly read that John’s Plan rejects routes to alignment that focus on control/myopia, routes that do not visit step 2.of successfully solving automatic/ambitious value learning first.
John, can you confirm this?
Background: my own default Plan does focus on control/myopia. I feel that this line of attack for solving AGI alignment (if we ever get weak or strong AGI) is reaching the stage where all the major points of ‘fundamental confusion’ have been solved. So for me this approach represents the true ‘easier strategy’.
It’s quite possible that control is easier than ambitious value learning, but I doubt that it’s as sustainable. Approaches like myopia, IDA, or HCH would probably get you an AGI that is aligned to much higher levels of intelligence than doing without them, all else being equal. But if there is nothing pulling its motivations explicitly back toward a basin of value alignment, then I feel like these approaches would be prone to diverging from alignment at some level beyond where any human could tell what’s going on with the system.
I do think that methods of control are worthwhile to pursue over the short term, but we had better be simultaneously working on ambitious value learning in the meantime for when an ASI inevitably escapes our control anyway. Even if myopia, for instance, worked perfectly to constrain what some AGI is able to conspire, it still seems likely that someone, somewhere, will try fiddling around with another AGI’s time horizon parameters and cause a disaster. It would be better if AGI models, from the beginning, had at least some value learning system built in by default to act as an extra safeguard.
I agree in general that pursuing multiple alternative alignment approaches (and using them all together to create higher levels of safety) is valuable. I am more optimistic than you that we can design control systems (different from time horizon based myopia) which will be stable and understandable even at higher levels of AGI competence.
Well, if you worry about people fiddling with control system tuning parameters, you also need to worry about someone fiddling with value learning parameters so that the AGI will only learn the values of a single group of people who would like to rule the rest of the world. Assming that AGI is possible, I believe it is most likely that Bostrom’s orthogonality hypothesis will hold for it. I am not optimistic about desiging an AGI system which is inherently fiddle-proof.