Thanks for writing this! Here are some of my rough thoughts and comments.
One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model ‘human values’. I think this is obviously false. LLMs already have a very good understanding of ‘human values’ as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models’ output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does appear to generalise reasonably well to examples which are highly unlikely to have been seen in training (although it errs on the side of overzealousness of late in my experience). This isn’t that surprising because such values do not have to be specified by the fine-tuning from scratch but should already be extremely well represented as concepts in the base model latent space and merely have to be given primacy. Things would be different, of course, if we wanted to align the LLMs to some truly arbitrary blue and orange morality not represented in the human text corpus, but naturally we don’t.
Of course such values cannot easily be represented as some mathematical utility function, but I think this is an extremely hard problem in general verging on impossible—since this is not the natural type of human values in the first place, which are naturally mostly linguistic constructs existing in the latent space and not in reality. This is not just a problem with human values but almost any kind of abstract goal you might want to give the AGI—including things like ‘maximise paperclips’. This is why almost certainly AGI will not be a direct utility maximiser but instead use a learnt utility function using latents from its own generative model, but in this case it can represent human values and indeed any goal expressible in natural language which of course it will understand.
On a related note this is also why I am not at all convinced by the supposed issues over indexicality. Having the requisite theory of mind to understand that different agents have different indexical needs should be table stakes to any serious AGI and indeed hardly any humans have issues with this, except for people trying to formalise it into math.
There is still a danger of over-optimisation, which is essentially a kind of overfitting and can be dealt with in a number of ways which are pretty standard now. In general terms, you would want the AI to represent its uncertainty over outcomes and utility approximator and use this to derive a conservative rather than pure maximising policy which can be adjusted over time.
I broadly agree with you about agency and consequentialism being broadly useful and ultimately we won’t just be creating short term myopic tool agents but fully long term consequentialists. I think the key thing here is just to understand that long term consequentialism has fundamental computational costs over short term consequentialism and much more challenging credit assignment dynamics so that it will only be used where it actually needs to be. Most systems will not be long term consequentialist because it is unnecessary for them.
I also think that breeding animals to do tasks or looking at humans subverting social institutions is not necessarily a good analogy to AI agents performing deception and treacherous turns. Evolution endowed humans and other animals with intrinsic selfish drives for survival and reproduction and arguably social deception which do not have to exist in AGIs. Moreover, we have substantially more control over AI cognition than evolution does over our cognition and gradient descent is fundamentally a more powerful optimiser which makes it challenging to produce deceptive agents. There is basically no evidence for deception occurring with current myopic AI systems and if it starts to occur with long term consequentialist agents it will be due to either a breakdown of credit assignment over long horizons (potentially due to being forced to use worse optimisers such as REINFORCE variants rather than pure BPTT) or the functional prior of such networks turning malign. Of course if we directly design AI agents via survival in some evolutionary sim or explicitly program in Omohundro drives then we will run directly into these problems again.
I’m defining “values” as what approximate expected utility optimizers in the human brain want. Maybe “wants” is a better word. People falsify their preferences and in those cases it seems more normative to go with internal optimizer preferences.
Re indexicality, this is an “the AI knows but does not care” issue, it’s about specifying it not about there being some AI module somewhere that “knows” it. If AGI were generated partially from humans understanding how to encode indexical goals that would be a different situation.
Re treacherous turns, I agreed that myopic agents don’t have this issue to nearly the extent that long-term real-world optimizing agents do. It depends how the AGI is selected. If it’s selected by “getting good performance according to a human evaluator in the real world” then at some capability level AGIs that “want” that will be selected more.
Why do you expect it to be hard to specify given a model that knows the information you’re looking for? In general the core lesson of unsupervised learning is that often the best way to get pointers to something you have a limited specification for is to learn some other task that necessarily includes it then specialize to that subtask. Why should values be any different? Broadly, why should values be harder to get good pointers to than much more complicated real-world tasks?
How would you design a task that incentivizes a system to output its true estimates of human values? We don’t have ground truth for human values, because they’re mind states not behaviors.
Seems easier to create incentives for things like “wash dishes without breaking them”, you can just tell.
I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they’re much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
I’d also note that “incentivize” is probably giving a lot of the game away here—my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
If you define “human values” as “what humans would say about their values across situations”, then yes, predicting “human values” is a reasonable training objective. Those just aren’t really what we “want” as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.
That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.
It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.
Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
I think this is actually a crux here, in that I think Yudkowsky and the broader evopsych world was broadly incorrect about how complicated human values turned to be, and way overestimated how much evolution was encoding priors and values in human brains, and I think there was another related error, in underestimating how much data affects your goals and values, like this example:
That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
I think that socialization will deeply affect their objectives of the expected utility maximizers, and I generally think that we shouldn’t view socialization as training people to fake particular values, because I believe that data absolutely matters way more than evopsych and LWers thought, for both humans and AIs.
You mentioned you take evopsych as true in this post, so I’m not saying this is a bad post, in fact, it’s an excellent distillation that points out the core assumption behind a lot of doom models, so I strongly upvoted, but I’m saying that this is almost certainly falsified for AIs, and probably also significantly false for humans too.
More generally, I’m skeptical of the assumption that all humans have similar or even not that different values, and dispute the assumptions of the psychological unity of humankind due to this.
Given this assumption, the human utility function(s) either do or don’t significantly depend on human evolutionary history. I’m just going to assume they do for now. I realize there is some disagreement about how important evopsych is for describing human values versus the attractors of universal learning machines, but I’m going to go with the evopsych branch for now.
Thanks for writing this! Here are some of my rough thoughts and comments.
One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model ‘human values’. I think this is obviously false. LLMs already have a very good understanding of ‘human values’ as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models’ output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does appear to generalise reasonably well to examples which are highly unlikely to have been seen in training (although it errs on the side of overzealousness of late in my experience). This isn’t that surprising because such values do not have to be specified by the fine-tuning from scratch but should already be extremely well represented as concepts in the base model latent space and merely have to be given primacy. Things would be different, of course, if we wanted to align the LLMs to some truly arbitrary blue and orange morality not represented in the human text corpus, but naturally we don’t.
Of course such values cannot easily be represented as some mathematical utility function, but I think this is an extremely hard problem in general verging on impossible—since this is not the natural type of human values in the first place, which are naturally mostly linguistic constructs existing in the latent space and not in reality. This is not just a problem with human values but almost any kind of abstract goal you might want to give the AGI—including things like ‘maximise paperclips’. This is why almost certainly AGI will not be a direct utility maximiser but instead use a learnt utility function using latents from its own generative model, but in this case it can represent human values and indeed any goal expressible in natural language which of course it will understand.
On a related note this is also why I am not at all convinced by the supposed issues over indexicality. Having the requisite theory of mind to understand that different agents have different indexical needs should be table stakes to any serious AGI and indeed hardly any humans have issues with this, except for people trying to formalise it into math.
There is still a danger of over-optimisation, which is essentially a kind of overfitting and can be dealt with in a number of ways which are pretty standard now. In general terms, you would want the AI to represent its uncertainty over outcomes and utility approximator and use this to derive a conservative rather than pure maximising policy which can be adjusted over time.
I broadly agree with you about agency and consequentialism being broadly useful and ultimately we won’t just be creating short term myopic tool agents but fully long term consequentialists. I think the key thing here is just to understand that long term consequentialism has fundamental computational costs over short term consequentialism and much more challenging credit assignment dynamics so that it will only be used where it actually needs to be. Most systems will not be long term consequentialist because it is unnecessary for them.
I also think that breeding animals to do tasks or looking at humans subverting social institutions is not necessarily a good analogy to AI agents performing deception and treacherous turns. Evolution endowed humans and other animals with intrinsic selfish drives for survival and reproduction and arguably social deception which do not have to exist in AGIs. Moreover, we have substantially more control over AI cognition than evolution does over our cognition and gradient descent is fundamentally a more powerful optimiser which makes it challenging to produce deceptive agents. There is basically no evidence for deception occurring with current myopic AI systems and if it starts to occur with long term consequentialist agents it will be due to either a breakdown of credit assignment over long horizons (potentially due to being forced to use worse optimisers such as REINFORCE variants rather than pure BPTT) or the functional prior of such networks turning malign. Of course if we directly design AI agents via survival in some evolutionary sim or explicitly program in Omohundro drives then we will run directly into these problems again.
I’m defining “values” as what approximate expected utility optimizers in the human brain want. Maybe “wants” is a better word. People falsify their preferences and in those cases it seems more normative to go with internal optimizer preferences.
Re indexicality, this is an “the AI knows but does not care” issue, it’s about specifying it not about there being some AI module somewhere that “knows” it. If AGI were generated partially from humans understanding how to encode indexical goals that would be a different situation.
Re treacherous turns, I agreed that myopic agents don’t have this issue to nearly the extent that long-term real-world optimizing agents do. It depends how the AGI is selected. If it’s selected by “getting good performance according to a human evaluator in the real world” then at some capability level AGIs that “want” that will be selected more.
Why do you expect it to be hard to specify given a model that knows the information you’re looking for? In general the core lesson of unsupervised learning is that often the best way to get pointers to something you have a limited specification for is to learn some other task that necessarily includes it then specialize to that subtask. Why should values be any different? Broadly, why should values be harder to get good pointers to than much more complicated real-world tasks?
How would you design a task that incentivizes a system to output its true estimates of human values? We don’t have ground truth for human values, because they’re mind states not behaviors.
Seems easier to create incentives for things like “wash dishes without breaking them”, you can just tell.
I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they’re much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
I’d also note that “incentivize” is probably giving a lot of the game away here—my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
If you define “human values” as “what humans would say about their values across situations”, then yes, predicting “human values” is a reasonable training objective. Those just aren’t really what we “want” as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.
That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.
It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.
I think this is actually a crux here, in that I think Yudkowsky and the broader evopsych world was broadly incorrect about how complicated human values turned to be, and way overestimated how much evolution was encoding priors and values in human brains, and I think there was another related error, in underestimating how much data affects your goals and values, like this example:
I think that socialization will deeply affect their objectives of the expected utility maximizers, and I generally think that we shouldn’t view socialization as training people to fake particular values, because I believe that data absolutely matters way more than evopsych and LWers thought, for both humans and AIs.
You mentioned you take evopsych as true in this post, so I’m not saying this is a bad post, in fact, it’s an excellent distillation that points out the core assumption behind a lot of doom models, so I strongly upvoted, but I’m saying that this is almost certainly falsified for AIs, and probably also significantly false for humans too.
More generally, I’m skeptical of the assumption that all humans have similar or even not that different values, and dispute the assumptions of the psychological unity of humankind due to this.