While I did agree that Linch’s comment reasonably accurately summarized my post, I don’t think a large part of my post was about the idea that we should now think that human values are much simpler than Yudkowsky portrayed them to be. Instead, I believe this section from Linch’s comment does a better job at conveying what I intended to be the main point,
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values.
Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
Well, in 2023 we have that Python program, with a few relaxations:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4
Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The primary point I intended to emphasize is not that human values are fundamentally simple, but rather that we now have something else important: an explicit, and cheaply computable representation of human values that can be directly utilized in AI development. This is a major step forward because it allows us to incorporate these values into programs in a way that provides clear and accurate feedback during processes like RLHF. This explicitness and legibility are critical for designing aligned AI systems, as they enable developers to work with a tangible and faithful specification of human values rather than relying on poor proxies that clearly do not track the full breadth and depth of what humans care about.
The fact that the underlying values may be relatively simple is less important than the fact that we can now operationalize them, in a way that reflects human judgement fairly well. Having a specification that is clear, structured, and usable means we are better equipped to train AI systems to share those values. This representation serves as a foundation for ensuring that the AI optimizes for what we actually care about, rather than inadvertently optimizing for proxies or unrelated objectives that merely correlate with training signals. In essence, the true significance lies in having a practical, actionable specification of human values that can actively guide the creation of future AI, not just in observing that these values may be less complex than previously assumed.
This is good news because this is more in line with my original understanding of your post. It’s a difficult topic because there are multiple closely related problems of varying degrees of lethality and we had updates on many of them between 2007 and 2023. I’m going to try to put the specific update you are pointing at into my own words.
From the perspective of 2007, we don’t know if we can lossilly extract human values into a convenient format using human intelligence and safe tools. We know that a superintelligence can do it (assuming that “human values” is meaningful), but we also know that if we try to do this with an unaligned superintelligence then we all die.
If this problem is unsolvable then we potentially have to create a seed AI using some more accessible value, such as corrigibility, and try to maintain that corrigibility as we ramp up intelligence. This then leads us to the problem of specifying corrigibility, and we see “Corrigibility is anti-natural to consequentialist reasoning” on List of Lethalities.
If this problem is solvable then we can use human values sooner and this gives us other options. Maybe we can find a basin of attraction around human values for example.
The update between 2007 and 2023 is that the problem appears solvable. GPT-4 is a safe tool (it exists and we aren’t extinct yet) and does a decent job. A more focused AI could do the task better without being riskier.
This does not mean that we are not going to die. Yudkowsky has 43 items on List of Lethalities. This post addresses part of item 24. The remaining items are sufficient to kill us ~42.5 times. It’s important to be able to discuss one lethality at a time if we want to die with dignity.
While I did agree that Linch’s comment reasonably accurately summarized my post, I don’t think a large part of my post was about the idea that we should now think that human values are much simpler than Yudkowsky portrayed them to be. Instead, I believe this section from Linch’s comment does a better job at conveying what I intended to be the main point,
The primary point I intended to emphasize is not that human values are fundamentally simple, but rather that we now have something else important: an explicit, and cheaply computable representation of human values that can be directly utilized in AI development. This is a major step forward because it allows us to incorporate these values into programs in a way that provides clear and accurate feedback during processes like RLHF. This explicitness and legibility are critical for designing aligned AI systems, as they enable developers to work with a tangible and faithful specification of human values rather than relying on poor proxies that clearly do not track the full breadth and depth of what humans care about.
The fact that the underlying values may be relatively simple is less important than the fact that we can now operationalize them, in a way that reflects human judgement fairly well. Having a specification that is clear, structured, and usable means we are better equipped to train AI systems to share those values. This representation serves as a foundation for ensuring that the AI optimizes for what we actually care about, rather than inadvertently optimizing for proxies or unrelated objectives that merely correlate with training signals. In essence, the true significance lies in having a practical, actionable specification of human values that can actively guide the creation of future AI, not just in observing that these values may be less complex than previously assumed.
This is good news because this is more in line with my original understanding of your post. It’s a difficult topic because there are multiple closely related problems of varying degrees of lethality and we had updates on many of them between 2007 and 2023. I’m going to try to put the specific update you are pointing at into my own words.
From the perspective of 2007, we don’t know if we can lossilly extract human values into a convenient format using human intelligence and safe tools. We know that a superintelligence can do it (assuming that “human values” is meaningful), but we also know that if we try to do this with an unaligned superintelligence then we all die.
If this problem is unsolvable then we potentially have to create a seed AI using some more accessible value, such as corrigibility, and try to maintain that corrigibility as we ramp up intelligence. This then leads us to the problem of specifying corrigibility, and we see “Corrigibility is anti-natural to consequentialist reasoning” on List of Lethalities.
If this problem is solvable then we can use human values sooner and this gives us other options. Maybe we can find a basin of attraction around human values for example.
The update between 2007 and 2023 is that the problem appears solvable. GPT-4 is a safe tool (it exists and we aren’t extinct yet) and does a decent job. A more focused AI could do the task better without being riskier.
This does not mean that we are not going to die. Yudkowsky has 43 items on List of Lethalities. This post addresses part of item 24. The remaining items are sufficient to kill us ~42.5 times. It’s important to be able to discuss one lethality at a time if we want to die with dignity.