This post argues that since 1. human values are necessary for alignment, 2. we are confused about human values, and 3. we couldn’t verify it if an AI system discovered the structure of human values, we need to do research to become less confused about human values. This research agenda aims to deconfuse human values by modeling them as the input to a decision process which produces behavior and preferences. The author’s best guess is that human values are captured by valence, as modeled by minimization of prediction error.
Planned opinion:
This is similar to the argument in <@Why we need a *theory* of human values@>, and my opinion remains roughly the same: I strongly agree that we are confused about human values, but I don’t see an understanding of human values as necessary for value alignment. We could hope to build AI systems in a way where we don’t need to specify the ultimate human values (or even a framework for learning them) before running the AI system. As an analogy, my friends and I are all confused about human values, but nonetheless I think they are more or less aligned with me (in the sense that if AI systems were like my friends but superintelligent, that sounds broadly fine).
I’ll push back on your opinion a little bit here as if it were just a regular LW comment on the post.
I strongly agree that we are confused about human values, but I don’t see an understanding of human values as necessary for value alignment. We could hope to build AI systems in a way where we don’t need to specify the ultimate human values (or even a framework for learning them) before running the AI system.
This is a reasonably hope but I generally think hope is dangerous when it comes to existential risks, so I’m moved to pursue this line of research because I believe it to be neglected, I believe it’s likely enough to be useful to building aligned AI to be worth pursuing, and I would rather us have explored it thoroughly and ended up not needing it than have not explored it and end up needing to have. I also don’t think it much takes away from other AI safety research, since the skills needed to work on this problem are somewhat different than those needed to address other AI safety problems (or so I think), so I mostly think we can pursue it for a fairly low opportunity cost.
As an analogy, my friends and I are all confused about human values, but nonetheless I think they are more or less aligned with me (in the sense that if AI systems were like my friends but superintelligent, that sounds broadly fine).
I expect we have a disagreement on how robust Goodhart problems are, as in I would expect that if you felt more or less aligned with a superintelligent AI system the way you feel you are aligned with your friends, the AI system would optimize so hard that it would no longer be aligned, and that the level of alignment you are talking about only works because of lack of optimization power. I suspect that at the level of measurement you’re talking about where you can infer alignment from observed behavior there is too much room for error between the measure and the target such that deviance is basically guaranteed.
Thankfully I know others are working on ways to engineer us around Goodhart problems, and maybe these solutions will be robust enough to work over such large measurement gaps, but again I am perhaps more conservative here and want to make the gap between the measure and the target much smaller so that we can effectively get “under” Goodhart effects for the targets we care about by measure and modeling the processes that generate those targets rather than the targets themselves.
When I say “hope”, I mean “it is reasonably likely that the research we do pans out and leads to a knowably-aligned AI system”, not “we will look at the AI system’s behavior, pull a risk estimate out of nowhere, and then proceed to deploy it anyway”.
In this sense, literally all AI risk research is based on hope, since no existing AI risk research knowably will lead to us building an aligned AI system.
I’m moved to pursue this line of research because I believe it to be neglected, I believe it’s likely enough to be useful to building aligned AI to be worth pursuing, and I would rather us have explored it thoroughly and ended up not needing it than have not explored it and end up needing to have.
This is all reasonable; most of it can be said about most AI risk research. The main distinguishing feature between different kinds of technical AI risk research is:
it’s likely enough to be useful to building aligned AI to be worth pursuing
So that’s the part you’d have to argue for to convince me (but also it would be reasonable not to bother).
I would expect that if you felt more or less aligned with a superintelligent AI system the way you feel you are aligned with your friends, the AI system would optimize so hard that it would no longer be aligned
Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?
Perhaps you think AI systems will be different in kind to your friends, in which case see next point.
I suspect that at the level of measurement you’re talking about where you can infer alignment from observed behavior there is too much room for error between the measure and the target such that deviance is basically guaranteed.
Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.
I am perhaps more conservative here and want to make the gap between the measure and the target much smaller so that we can effectively get “under” Goodhart effects for the targets we care about by measure and modeling the processes that generate those targets rather than the targets themselves.
It’s not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It’s much easier for me to model humans than to model evolution.
Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?
I expect it to be net negative. My model is something like humans are not very agentic (able to reliably achieve/optimize for a goal) in absolute terms even though we may feel as though humans are especially agentic relative to other systems, and because humans bumble a lot they don’t tend to have a lot of impact and things work out well or poorly on average as a result of lots of moves that cancel each other out and only leave a small gain or loss in valued outcomes in the end. A 10x smarter human would be more agentic, and if they are not exactly right about how to do good they could more easily do harm that would normally be buffered by their ineffectiveness.
I build this intuition from, for example, the way dictators often screw things up even when they are well intentioned because they now have more power to achieve their goals and it amplifies their mistakes and misunderstandings in ways that cause more impact, more variance, and historically worse outcomes than less agentic methods of leadership.
Although this is not a perfect analogy because 10x smarter is not just 10x more powerful/agentic but 10x better able to think through consequences (which the dictators lacks), I also think the orthogonality thesis is robust enough that it’s more likely to me that 10x smarter will not mean a match in ability to think through consequences that will perfectly offset the risks of greater agency.
Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.
Exactly, because you can’t infer alignment from observed behavior without normative assumptions. I’m saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.
It’s not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It’s much easier for me to model humans than to model evolution.
It’s definitely harder. That’s a reasonable consideration when we’re trying to engineer a system that will be good enough while racing against the clock, and I think it’s quite reasonable, for example, that we’re going to try to tackle value alignment via extensions to narrow value learning approaches first because that’s easier to build. But I also think those approaches will fail and so I’m looking ahead to where I see the limits of our knowledge for what we’ll have to do conditioned on this bet I’m making that value learning approaches similar in kind to those we’re trying now won’t produce aligned AIs.
I’d be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren’t well-intentioned or 2. they didn’t have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).
I’m saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.
I know you’re saying that, I just don’t see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That’s fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.
It’s definitely harder.
This is an assertion, not an argument.
Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you’ve learned by looking at humans), relative to building a model by looking at humans (and using no information you’ve learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.
I’d be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren’t well-intentioned or 2. they didn’t have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).
Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:
Joseph Stalin’s collectivization of farms
Tokugawa Iemitsu’s closing off of Japan
Hugo Chávez’s nationalization of many industries
I know you’re saying that, I just don’t see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That’s fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.
Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you’ve learned by looking at humans), relative to building a model by looking at humans (and using no information you’ve learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.
No, it’s not my goal that we not look at humans. I instead think we’re currently too focused on trying to figure out everything from only looking at the kinds of evidence we can easily collect today, and that we also don’t have detailed enough models to know what other evidence is likely relevant. I think understanding whatever is going on with values is hard because there is data further “down the stack”, if you will, from observations of behavior that is relevant. I think that because I look at issues like latent preferences that by definition exist because we didn’t have enough data to infer their existence but that need not necessarily exist if we gather more data about how those latent preferences are generated such that we could discover them in advance by looking earlier in the process that generates them.
Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:
What’s your model for why those actions weren’t undone?
To pop back up to the original question—if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it’s only good to make them 2x smarter, but after that more marginal intelligence is bad?
It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we’re at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let’s suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?
(I’m aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.)
Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they’d be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.
What’s your model for why those actions weren’t undone?
Not quite sure what you’re asking here. In the first two cases they eventually were undone after people got fed up with the situation, the last is recent enough I don’t consider it’s not having already been undone as evidence people like it, only that they don’t have the power to change it. My view is that these changes stayed in place because the dictators and their successors continued to believe the good out weighted the harm when either this was clearly contrary to the ground truth but served some narrow purpose that was viewed as more important or when the ground truth was too hard to discover at the time and we only believe it was net harmful through the lens of historical analysis.
To pop back up to the original question—if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it’s only good to make them 2x smarter, but after that more marginal intelligence is bad?
It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we’re at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let’s suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?
I’m not claiming we’re at some optimal level of intelligence for any particular purpose, only that more intelligence leads to greater agency which, in the absence of sufficient mechanisms to constrain actions to beneficial ones, results in greater risk of negative outcomes due to things like deviance and unilateral action. Thus I do in fact think we’d be safer from ourselves, for example screening off existential risks humanity faces due to outside threats like asteroids, if we were dumber.
By comparison, chimpanzees may not live what look to us like very happy lives, they are some factor dumber than us, but also they aren’t at risk of making themselves extinct because one chimp really wanted a lot of bananas.
I’m not sure how much smarter we could all get without putting us at too much risk. I think there’s an anthropic argument to be made that we are below whatever level of intelligence is dangerous to ourselves without greater safeguards because we wouldn’t exist in such universes due to having killed ourselves, but I feel like I have little evidence to make a judgement about how much smarter is safe given, for example, being, say, 95th percentile smart didn’t stop people from building things like atomic weapons or developing dangerous chemical applications. I would expect making my friends smarter to risk similarly bad outcomes. Making them dumber seems safer, especially when I’m in the frame of thinking about AGI.
Planned summary for the Alignment Newsletter:
Planned opinion:
Yep, agree with the summary.
I’ll push back on your opinion a little bit here as if it were just a regular LW comment on the post.
This is a reasonably hope but I generally think hope is dangerous when it comes to existential risks, so I’m moved to pursue this line of research because I believe it to be neglected, I believe it’s likely enough to be useful to building aligned AI to be worth pursuing, and I would rather us have explored it thoroughly and ended up not needing it than have not explored it and end up needing to have. I also don’t think it much takes away from other AI safety research, since the skills needed to work on this problem are somewhat different than those needed to address other AI safety problems (or so I think), so I mostly think we can pursue it for a fairly low opportunity cost.
I expect we have a disagreement on how robust Goodhart problems are, as in I would expect that if you felt more or less aligned with a superintelligent AI system the way you feel you are aligned with your friends, the AI system would optimize so hard that it would no longer be aligned, and that the level of alignment you are talking about only works because of lack of optimization power. I suspect that at the level of measurement you’re talking about where you can infer alignment from observed behavior there is too much room for error between the measure and the target such that deviance is basically guaranteed.
Thankfully I know others are working on ways to engineer us around Goodhart problems, and maybe these solutions will be robust enough to work over such large measurement gaps, but again I am perhaps more conservative here and want to make the gap between the measure and the target much smaller so that we can effectively get “under” Goodhart effects for the targets we care about by measure and modeling the processes that generate those targets rather than the targets themselves.
When I say “hope”, I mean “it is reasonably likely that the research we do pans out and leads to a knowably-aligned AI system”, not “we will look at the AI system’s behavior, pull a risk estimate out of nowhere, and then proceed to deploy it anyway”.
In this sense, literally all AI risk research is based on hope, since no existing AI risk research knowably will lead to us building an aligned AI system.
This is all reasonable; most of it can be said about most AI risk research. The main distinguishing feature between different kinds of technical AI risk research is:
So that’s the part you’d have to argue for to convince me (but also it would be reasonable not to bother).
Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?
Perhaps you think AI systems will be different in kind to your friends, in which case see next point.
Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.
It’s not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It’s much easier for me to model humans than to model evolution.
I expect it to be net negative. My model is something like humans are not very agentic (able to reliably achieve/optimize for a goal) in absolute terms even though we may feel as though humans are especially agentic relative to other systems, and because humans bumble a lot they don’t tend to have a lot of impact and things work out well or poorly on average as a result of lots of moves that cancel each other out and only leave a small gain or loss in valued outcomes in the end. A 10x smarter human would be more agentic, and if they are not exactly right about how to do good they could more easily do harm that would normally be buffered by their ineffectiveness.
I build this intuition from, for example, the way dictators often screw things up even when they are well intentioned because they now have more power to achieve their goals and it amplifies their mistakes and misunderstandings in ways that cause more impact, more variance, and historically worse outcomes than less agentic methods of leadership.
Although this is not a perfect analogy because 10x smarter is not just 10x more powerful/agentic but 10x better able to think through consequences (which the dictators lacks), I also think the orthogonality thesis is robust enough that it’s more likely to me that 10x smarter will not mean a match in ability to think through consequences that will perfectly offset the risks of greater agency.
Exactly, because you can’t infer alignment from observed behavior without normative assumptions. I’m saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.
It’s definitely harder. That’s a reasonable consideration when we’re trying to engineer a system that will be good enough while racing against the clock, and I think it’s quite reasonable, for example, that we’re going to try to tackle value alignment via extensions to narrow value learning approaches first because that’s easier to build. But I also think those approaches will fail and so I’m looking ahead to where I see the limits of our knowledge for what we’ll have to do conditioned on this bet I’m making that value learning approaches similar in kind to those we’re trying now won’t produce aligned AIs.
Man, I do not share that intuition.
I’d be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren’t well-intentioned or 2. they didn’t have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).
I know you’re saying that, I just don’t see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That’s fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.
This is an assertion, not an argument.
Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you’ve learned by looking at humans), relative to building a model by looking at humans (and using no information you’ve learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.
Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:
Joseph Stalin’s collectivization of farms
Tokugawa Iemitsu’s closing off of Japan
Hugo Chávez’s nationalization of many industries
I’ve made my case for that here.
No, it’s not my goal that we not look at humans. I instead think we’re currently too focused on trying to figure out everything from only looking at the kinds of evidence we can easily collect today, and that we also don’t have detailed enough models to know what other evidence is likely relevant. I think understanding whatever is going on with values is hard because there is data further “down the stack”, if you will, from observations of behavior that is relevant. I think that because I look at issues like latent preferences that by definition exist because we didn’t have enough data to infer their existence but that need not necessarily exist if we gather more data about how those latent preferences are generated such that we could discover them in advance by looking earlier in the process that generates them.
What’s your model for why those actions weren’t undone?
To pop back up to the original question—if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it’s only good to make them 2x smarter, but after that more marginal intelligence is bad?
It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we’re at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let’s suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?
(I’m aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.)
Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they’d be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.
Not quite sure what you’re asking here. In the first two cases they eventually were undone after people got fed up with the situation, the last is recent enough I don’t consider it’s not having already been undone as evidence people like it, only that they don’t have the power to change it. My view is that these changes stayed in place because the dictators and their successors continued to believe the good out weighted the harm when either this was clearly contrary to the ground truth but served some narrow purpose that was viewed as more important or when the ground truth was too hard to discover at the time and we only believe it was net harmful through the lens of historical analysis.
I’m not claiming we’re at some optimal level of intelligence for any particular purpose, only that more intelligence leads to greater agency which, in the absence of sufficient mechanisms to constrain actions to beneficial ones, results in greater risk of negative outcomes due to things like deviance and unilateral action. Thus I do in fact think we’d be safer from ourselves, for example screening off existential risks humanity faces due to outside threats like asteroids, if we were dumber.
By comparison, chimpanzees may not live what look to us like very happy lives, they are some factor dumber than us, but also they aren’t at risk of making themselves extinct because one chimp really wanted a lot of bananas.
I’m not sure how much smarter we could all get without putting us at too much risk. I think there’s an anthropic argument to be made that we are below whatever level of intelligence is dangerous to ourselves without greater safeguards because we wouldn’t exist in such universes due to having killed ourselves, but I feel like I have little evidence to make a judgement about how much smarter is safe given, for example, being, say, 95th percentile smart didn’t stop people from building things like atomic weapons or developing dangerous chemical applications. I would expect making my friends smarter to risk similarly bad outcomes. Making them dumber seems safer, especially when I’m in the frame of thinking about AGI.