I think I read this a few times but I still don’t think I fully understand your point. I’m going to try to rephrase what I believe you are saying in my own words:
Our correct epistemic state in 2000 or 2010 should be to have a lot of uncertainty about the complexity and fragility of human values. Perhaps it is very complex, but perhaps people are just not approaching it correctly.
At the limit, the level of complexity can approach “simulate a number of human beings in constant conversation and moral deliberation with each other, embedded in the existing broader environment, and where a small mistake in the simulation renders the entire thing broken in the sense of losing almost all moral value in the universe if that’s what you point at”
At the other, you can imagine a fairly simple mathematical statement that’s practically robust to any OOD environments or small perturbations.
In worlds where human values aren’t very complex, alignment isn’t solved, but you should perhaps expect it to be (significantly) easier. (“Optimize for this mathematical statement” is an easier thing to point at than “optimize for the outcome of this complex deliberation, no, not the actual answers out of their mouths but the indirect more abstract thing they point at”)
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values.
Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
Well, in 2023 we have that Python program, with a few relaxations:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4
Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations
We should then rationally update on the complexity of human values. It’s probably not much more complex than GPT-4, and possibly significantly simpler. ie, the fact that we have a pretty good description of human values well short of superintelligent AI means we should not expect a perfect description of human values to be very complex either.
This is a different claim from saying that Superintelligent AIs will understand human values; which everybody agrees with. Human values isn’t any more mysterious from the perspective of physics than any other emergent property like fluid dynamics or the formation of cities.
However, if AIs needed to be superintelligent (eg at the level of approximating physics simulations of Earth) before they grasp human values, that’d be too late, as they can/will destroy the world before their human creators can task a training process (or other ways of making AGI) towards {this thing that we mean when we say human values}.
But instead, the world we live in is one where we can point future AGIs towards the outputs of GPT-N when asked questions about morality as the thing to optimize for.
Which, again, isn’t to say the alignment problem is solved, we might still all die because future AGIs could just be like “lol nope” to the outputs of GPT-N, or try to hack it to produce adversarial results, or something. But at least one subset of the problem is either solved or a non-issue, depending on your POV.
Given all this, MIRI appeared to empirically be wrong when they previously talked about the complexity and fragility of human values. Human values now seem noticeably less complex than many possibilities, and empirically we already have a pretty good representation of human values in silica.
I’m not saying anything about the fragility of value argument, since that seems like a separate argument than the argument that value is complex. I think the fragility of value argument is plausibly a statement about how easy it is to mess up if you get human values wrong, which still seems true depending on one’s point of view (e.g. if the AI exhibits all human values except it thinks murder is OK, then that could be catastrophic).
Overall, while I definitely could have been clearer when writing this post, the fact that you seemed to understand virtually all my points makes me feel better about this post than I originally felt.
Thanks! Though tbh I don’t think I fully got the core point via reading the post so I should only get partial credit; for me it took Alexander’s comment to make everything click together.
I think I read this a few times but I still don’t think I fully understand your point. I’m going to try to rephrase what I believe you are saying in my own words:
Our correct epistemic state in 2000 or 2010 should be to have a lot of uncertainty about the complexity and fragility of human values. Perhaps it is very complex, but perhaps people are just not approaching it correctly.
At the limit, the level of complexity can approach “simulate a number of human beings in constant conversation and moral deliberation with each other, embedded in the existing broader environment, and where a small mistake in the simulation renders the entire thing broken in the sense of losing almost all moral value in the universe if that’s what you point at”
At the other, you can imagine a fairly simple mathematical statement that’s practically robust to any OOD environments or small perturbations.
In worlds where human values aren’t very complex, alignment isn’t solved, but you should perhaps expect it to be (significantly) easier. (“Optimize for this mathematical statement” is an easier thing to point at than “optimize for the outcome of this complex deliberation, no, not the actual answers out of their mouths but the indirect more abstract thing they point at”)
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values.
Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
Well, in 2023 we have that Python program, with a few relaxations:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4
Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations
We should then rationally update on the complexity of human values. It’s probably not much more complex than GPT-4, and possibly significantly simpler. ie, the fact that we have a pretty good description of human values well short of superintelligent AI means we should not expect a perfect description of human values to be very complex either.
This is a different claim from saying that Superintelligent AIs will understand human values; which everybody agrees with. Human values isn’t any more mysterious from the perspective of physics than any other emergent property like fluid dynamics or the formation of cities.
However, if AIs needed to be superintelligent (eg at the level of approximating physics simulations of Earth) before they grasp human values, that’d be too late, as they can/will destroy the world before their human creators can task a training process (or other ways of making AGI) towards {this thing that we mean when we say human values}.
But instead, the world we live in is one where we can point future AGIs towards the outputs of GPT-N when asked questions about morality as the thing to optimize for.
Which, again, isn’t to say the alignment problem is solved, we might still all die because future AGIs could just be like “lol nope” to the outputs of GPT-N, or try to hack it to produce adversarial results, or something. But at least one subset of the problem is either solved or a non-issue, depending on your POV.
Given all this, MIRI appeared to empirically be wrong when they previously talked about the complexity and fragility of human values. Human values now seem noticeably less complex than many possibilities, and empirically we already have a pretty good representation of human values in silica.
Is my summary reasonably correct?
Yes, I think so, with one caveat:
I’m not saying anything about the fragility of value argument, since that seems like a separate argument than the argument that value is complex. I think the fragility of value argument is plausibly a statement about how easy it is to mess up if you get human values wrong, which still seems true depending on one’s point of view (e.g. if the AI exhibits all human values except it thinks murder is OK, then that could be catastrophic).
Overall, while I definitely could have been clearer when writing this post, the fact that you seemed to understand virtually all my points makes me feel better about this post than I originally felt.
Thanks! Though tbh I don’t think I fully got the core point via reading the post so I should only get partial credit; for me it took Alexander’s comment to make everything click together.