I think I read this a few times but I still don’t think I fully understand your point. I’m going to try to rephrase what I believe you are saying in my own words:
Our correct epistemic state in 2000 or 2010 should be to have a lot of uncertainty about the complexity and fragility of human values. Perhaps it is very complex, but perhaps people are just not approaching it correctly.
At the limit, the level of complexity can approach “simulate a number of human beings in constant conversation and moral deliberation with each other, embedded in the existing broader environment, and where a small mistake in the simulation renders the entire thing broken in the sense of losing almost all moral value in the universe if that’s what you point at”
At the other, you can imagine a fairly simple mathematical statement that’s practically robust to any OOD environments or small perturbations.
In worlds where human values aren’t very complex, alignment isn’t solved, but you should perhaps expect it to be (significantly) easier. (“Optimize for this mathematical statement” is an easier thing to point at than “optimize for the outcome of this complex deliberation, no, not the actual answers out of their mouths but the indirect more abstract thing they point at”)
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values.
Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
Well, in 2023 we have that Python program, with a few relaxations:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4
Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations
We should then rationally update on the complexity of human values. It’s probably not much more complex than GPT-4, and possibly significantly simpler. ie, the fact that we have a pretty good description of human values well short of superintelligent AI means we should not expect a perfect description of human values to be very complex either.
This is a different claim from saying that Superintelligent AIs will understand human values; which everybody agrees with. Human values isn’t any more mysterious from the perspective of physics than any other emergent property like fluid dynamics or the formation of cities.
However, if AIs needed to be superintelligent (eg at the level of approximating physics simulations of Earth) before they grasp human values, that’d be too late, as they can/will destroy the world before their human creators can task a training process (or other ways of making AGI) towards {this thing that we mean when we say human values}.
But instead, the world we live in is one where we can point future AGIs towards the outputs of GPT-N when asked questions about morality as the thing to optimize for.
Which, again, isn’t to say the alignment problem is solved, we might still all die because future AGIs could just be like “lol nope” to the outputs of GPT-N, or try to hack it to produce adversarial results, or something. But at least one subset of the problem is either solved or a non-issue, depending on your POV.
Given all this, MIRI appeared to empirically be wrong when they previously talked about the complexity and fragility of human values. Human values now seem noticeably less complex than many possibilities, and empirically we already have a pretty good representation of human values in silica.
My read of older posts from Yudkowsky is that he anticipated a midrange level of complexity of human values, compared to your scale of simple mathematical function to perfect simulation of human experts.
Yudkowsky argued against very low complexity human values in a few places. There’s an explicit argument against Fake Utility Functions that are simple mathematical functions. The Fun Theory Sequence is too big if human values are a 100 line python program.
But also Yudkowsky’s writing is incompatible with extremely complicated human values that require a perfect simulation of human experts to address. This argument is more implicit, I think because that was not a common position. Look at Thou Art Godshatter and how it places the source of human values in the human genome, downstream of the “blind idiot god” of Evolution. If true, human values must be far less complicated than the human genome.
GPT-4 is about 1,000x bigger than the human genome. Therefore when we see that GPT-4 can represent human values with high fidelity this is not a surprise to Godshatter Theory. It will be surprising if we see that very small AI models, much smaller than the human genome, can represent human values accurately.
Disclaimers: I’m not replying to the thread about fragility of value, only complexity. I disagree with Godshatter Theory on other grounds. I agree that it is a small positive update that human values are less complex than GPT-4.
While I did agree that Linch’s comment reasonably accurately summarized my post, I don’t think a large part of my post was about the idea that we should now think that human values are much simpler than Yudkowsky portrayed them to be. Instead, I believe this section from Linch’s comment does a better job at conveying what I intended to be the main point,
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values.
Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
Well, in 2023 we have that Python program, with a few relaxations:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4
Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The primary point I intended to emphasize is not that human values are fundamentally simple, but rather that we now have something else important: an explicit, and cheaply computable representation of human values that can be directly utilized in AI development. This is a major step forward because it allows us to incorporate these values into programs in a way that provides clear and accurate feedback during processes like RLHF. This explicitness and legibility are critical for designing aligned AI systems, as they enable developers to work with a tangible and faithful specification of human values rather than relying on poor proxies that clearly do not track the full breadth and depth of what humans care about.
The fact that the underlying values may be relatively simple is less important than the fact that we can now operationalize them, in a way that reflects human judgement fairly well. Having a specification that is clear, structured, and usable means we are better equipped to train AI systems to share those values. This representation serves as a foundation for ensuring that the AI optimizes for what we actually care about, rather than inadvertently optimizing for proxies or unrelated objectives that merely correlate with training signals. In essence, the true significance lies in having a practical, actionable specification of human values that can actively guide the creation of future AI, not just in observing that these values may be less complex than previously assumed.
This is good news because this is more in line with my original understanding of your post. It’s a difficult topic because there are multiple closely related problems of varying degrees of lethality and we had updates on many of them between 2007 and 2023. I’m going to try to put the specific update you are pointing at into my own words.
From the perspective of 2007, we don’t know if we can lossilly extract human values into a convenient format using human intelligence and safe tools. We know that a superintelligence can do it (assuming that “human values” is meaningful), but we also know that if we try to do this with an unaligned superintelligence then we all die.
If this problem is unsolvable then we potentially have to create a seed AI using some more accessible value, such as corrigibility, and try to maintain that corrigibility as we ramp up intelligence. This then leads us to the problem of specifying corrigibility, and we see “Corrigibility is anti-natural to consequentialist reasoning” on List of Lethalities.
If this problem is solvable then we can use human values sooner and this gives us other options. Maybe we can find a basin of attraction around human values for example.
The update between 2007 and 2023 is that the problem appears solvable. GPT-4 is a safe tool (it exists and we aren’t extinct yet) and does a decent job. A more focused AI could do the task better without being riskier.
This does not mean that we are not going to die. Yudkowsky has 43 items on List of Lethalities. This post addresses part of item 24. The remaining items are sufficient to kill us ~42.5 times. It’s important to be able to discuss one lethality at a time if we want to die with dignity.
I’m not saying anything about the fragility of value argument, since that seems like a separate argument than the argument that value is complex. I think the fragility of value argument is plausibly a statement about how easy it is to mess up if you get human values wrong, which still seems true depending on one’s point of view (e.g. if the AI exhibits all human values except it thinks murder is OK, then that could be catastrophic).
Overall, while I definitely could have been clearer when writing this post, the fact that you seemed to understand virtually all my points makes me feel better about this post than I originally felt.
Thanks! Though tbh I don’t think I fully got the core point via reading the post so I should only get partial credit; for me it took Alexander’s comment to make everything click together.
This comment made me reflect on what fragility of values means.
To me this point was always most salient when thinking about embodied agents, which may need to reliably recognize something like “people” in its environment (in order to instantiate human values like “try not to hurt people”) even as the world changes radically with the introduction of various forms of transhumanism.
I guess it’s not clear to me how much progress we make towards that with a system that can do a very good job with human values when restricted to the text domain. Plausibly we just translate everything into text and are good to go? It makes me wonder where we’re at with adversarial robustness of vision-language models, e.g.
I think I’m relatively optimistic that the difference between a system that “can (and will) do a very good job with human values when restricted to the text domain: vs “system that can do a very good job, unrestricted” isn’t that high. This is because I’m personally fairly skeptical about arguments along the lines of “words aren’t human thinking, words are mere shadows of human thinking” that people put out, at least when it comes to human values.
(It’s definitely possible to come up with examples that illustrates the differences between all of human thinking and human-thinking-put-into-words; I agree about their existence, I disagree about their importance).
“The AI’s values don’t generalize well outside of the text domain (e.g. to a humanoid robot)”
and more:
“The AI’s values must be much more aligned in order to be safe outside the text domain”
I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot.
This would be because the richer domain / interface of the robot creates many more opportunities to “exploit” whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.
Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.
“The AI’s values must be much more aligned in order to be safe outside the text domain”
There are 2 senses in which I agree that we don’t need full on “capital V value alignment”:
We can build things that aren’t utility maximizers (e.g. consider the humble MNIST classifier)
There are some utility functions that aren’t quite right, but are still safe enough to optimize in practice (e.g. see “Value Alignment Verification”, but see also, e.g. “Defining and Characterizing Reward Hacking” for negative results)
But also:
Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns—CAVEAT: agency is not a unidimensional quantity, cf: “Harms from Increasingly Agentic Algorithmic Systems”).
Note that my statement was about the relative requirements for alignment in text domains vs. real-world. I don’t really see how your arguments are relevant to this question.
Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial “hack” on it’s values leading to behavior that significantly diverges from things humans would endorse.
I think I read this a few times but I still don’t think I fully understand your point. I’m going to try to rephrase what I believe you are saying in my own words:
Our correct epistemic state in 2000 or 2010 should be to have a lot of uncertainty about the complexity and fragility of human values. Perhaps it is very complex, but perhaps people are just not approaching it correctly.
At the limit, the level of complexity can approach “simulate a number of human beings in constant conversation and moral deliberation with each other, embedded in the existing broader environment, and where a small mistake in the simulation renders the entire thing broken in the sense of losing almost all moral value in the universe if that’s what you point at”
At the other, you can imagine a fairly simple mathematical statement that’s practically robust to any OOD environments or small perturbations.
In worlds where human values aren’t very complex, alignment isn’t solved, but you should perhaps expect it to be (significantly) easier. (“Optimize for this mathematical statement” is an easier thing to point at than “optimize for the outcome of this complex deliberation, no, not the actual answers out of their mouths but the indirect more abstract thing they point at”)
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values.
Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
Well, in 2023 we have that Python program, with a few relaxations:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4
Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations
We should then rationally update on the complexity of human values. It’s probably not much more complex than GPT-4, and possibly significantly simpler. ie, the fact that we have a pretty good description of human values well short of superintelligent AI means we should not expect a perfect description of human values to be very complex either.
This is a different claim from saying that Superintelligent AIs will understand human values; which everybody agrees with. Human values isn’t any more mysterious from the perspective of physics than any other emergent property like fluid dynamics or the formation of cities.
However, if AIs needed to be superintelligent (eg at the level of approximating physics simulations of Earth) before they grasp human values, that’d be too late, as they can/will destroy the world before their human creators can task a training process (or other ways of making AGI) towards {this thing that we mean when we say human values}.
But instead, the world we live in is one where we can point future AGIs towards the outputs of GPT-N when asked questions about morality as the thing to optimize for.
Which, again, isn’t to say the alignment problem is solved, we might still all die because future AGIs could just be like “lol nope” to the outputs of GPT-N, or try to hack it to produce adversarial results, or something. But at least one subset of the problem is either solved or a non-issue, depending on your POV.
Given all this, MIRI appeared to empirically be wrong when they previously talked about the complexity and fragility of human values. Human values now seem noticeably less complex than many possibilities, and empirically we already have a pretty good representation of human values in silica.
Is my summary reasonably correct?
My read of older posts from Yudkowsky is that he anticipated a midrange level of complexity of human values, compared to your scale of simple mathematical function to perfect simulation of human experts.
Yudkowsky argued against very low complexity human values in a few places. There’s an explicit argument against Fake Utility Functions that are simple mathematical functions. The Fun Theory Sequence is too big if human values are a 100 line python program.
But also Yudkowsky’s writing is incompatible with extremely complicated human values that require a perfect simulation of human experts to address. This argument is more implicit, I think because that was not a common position. Look at Thou Art Godshatter and how it places the source of human values in the human genome, downstream of the “blind idiot god” of Evolution. If true, human values must be far less complicated than the human genome.
GPT-4 is about 1,000x bigger than the human genome. Therefore when we see that GPT-4 can represent human values with high fidelity this is not a surprise to Godshatter Theory. It will be surprising if we see that very small AI models, much smaller than the human genome, can represent human values accurately.
Disclaimers: I’m not replying to the thread about fragility of value, only complexity. I disagree with Godshatter Theory on other grounds. I agree that it is a small positive update that human values are less complex than GPT-4.
While I did agree that Linch’s comment reasonably accurately summarized my post, I don’t think a large part of my post was about the idea that we should now think that human values are much simpler than Yudkowsky portrayed them to be. Instead, I believe this section from Linch’s comment does a better job at conveying what I intended to be the main point,
The primary point I intended to emphasize is not that human values are fundamentally simple, but rather that we now have something else important: an explicit, and cheaply computable representation of human values that can be directly utilized in AI development. This is a major step forward because it allows us to incorporate these values into programs in a way that provides clear and accurate feedback during processes like RLHF. This explicitness and legibility are critical for designing aligned AI systems, as they enable developers to work with a tangible and faithful specification of human values rather than relying on poor proxies that clearly do not track the full breadth and depth of what humans care about.
The fact that the underlying values may be relatively simple is less important than the fact that we can now operationalize them, in a way that reflects human judgement fairly well. Having a specification that is clear, structured, and usable means we are better equipped to train AI systems to share those values. This representation serves as a foundation for ensuring that the AI optimizes for what we actually care about, rather than inadvertently optimizing for proxies or unrelated objectives that merely correlate with training signals. In essence, the true significance lies in having a practical, actionable specification of human values that can actively guide the creation of future AI, not just in observing that these values may be less complex than previously assumed.
This is good news because this is more in line with my original understanding of your post. It’s a difficult topic because there are multiple closely related problems of varying degrees of lethality and we had updates on many of them between 2007 and 2023. I’m going to try to put the specific update you are pointing at into my own words.
From the perspective of 2007, we don’t know if we can lossilly extract human values into a convenient format using human intelligence and safe tools. We know that a superintelligence can do it (assuming that “human values” is meaningful), but we also know that if we try to do this with an unaligned superintelligence then we all die.
If this problem is unsolvable then we potentially have to create a seed AI using some more accessible value, such as corrigibility, and try to maintain that corrigibility as we ramp up intelligence. This then leads us to the problem of specifying corrigibility, and we see “Corrigibility is anti-natural to consequentialist reasoning” on List of Lethalities.
If this problem is solvable then we can use human values sooner and this gives us other options. Maybe we can find a basin of attraction around human values for example.
The update between 2007 and 2023 is that the problem appears solvable. GPT-4 is a safe tool (it exists and we aren’t extinct yet) and does a decent job. A more focused AI could do the task better without being riskier.
This does not mean that we are not going to die. Yudkowsky has 43 items on List of Lethalities. This post addresses part of item 24. The remaining items are sufficient to kill us ~42.5 times. It’s important to be able to discuss one lethality at a time if we want to die with dignity.
Thanks, I’d be interested in @Matthew Barnett’s response.
Yes, I think so, with one caveat:
I’m not saying anything about the fragility of value argument, since that seems like a separate argument than the argument that value is complex. I think the fragility of value argument is plausibly a statement about how easy it is to mess up if you get human values wrong, which still seems true depending on one’s point of view (e.g. if the AI exhibits all human values except it thinks murder is OK, then that could be catastrophic).
Overall, while I definitely could have been clearer when writing this post, the fact that you seemed to understand virtually all my points makes me feel better about this post than I originally felt.
Thanks! Though tbh I don’t think I fully got the core point via reading the post so I should only get partial credit; for me it took Alexander’s comment to make everything click together.
This comment made me reflect on what fragility of values means.
To me this point was always most salient when thinking about embodied agents, which may need to reliably recognize something like “people” in its environment (in order to instantiate human values like “try not to hurt people”) even as the world changes radically with the introduction of various forms of transhumanism.
I guess it’s not clear to me how much progress we make towards that with a system that can do a very good job with human values when restricted to the text domain. Plausibly we just translate everything into text and are good to go? It makes me wonder where we’re at with adversarial robustness of vision-language models, e.g.
I think I’m relatively optimistic that the difference between a system that “can (and will) do a very good job with human values when restricted to the text domain: vs “system that can do a very good job, unrestricted” isn’t that high. This is because I’m personally fairly skeptical about arguments along the lines of “words aren’t human thinking, words are mere shadows of human thinking” that people put out, at least when it comes to human values.
(It’s definitely possible to come up with examples that illustrates the differences between all of human thinking and human-thinking-put-into-words; I agree about their existence, I disagree about their importance).
OTMH, I think my concern here is less:
“The AI’s values don’t generalize well outside of the text domain (e.g. to a humanoid robot)”
and more:
“The AI’s values must be much more aligned in order to be safe outside the text domain”
I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot.
This would be because the richer domain / interface of the robot creates many more opportunities to “exploit” whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.
Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.
There are 2 senses in which I agree that we don’t need full on “capital V value alignment”:
We can build things that aren’t utility maximizers (e.g. consider the humble MNIST classifier)
There are some utility functions that aren’t quite right, but are still safe enough to optimize in practice (e.g. see “Value Alignment Verification”, but see also, e.g. “Defining and Characterizing Reward Hacking” for negative results)
But also:
Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns—CAVEAT: agency is not a unidimensional quantity, cf: “Harms from Increasingly Agentic Algorithmic Systems”).
Note that my statement was about the relative requirements for alignment in text domains vs. real-world. I don’t really see how your arguments are relevant to this question.
Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial “hack” on it’s values leading to behavior that significantly diverges from things humans would endorse.