Interesting. It currently seems to me like collective intent alignment (which I think is what I’m calling value alignment? more below) is way harder than personal intent alignment. So I’m curious where our thinking differs.
I think people are going to want instruction following, not inferring intent from other sources, because they won’t trust the AGI to accurately infer intent that’s not explicitly stated.
I know that’s long been considered terribly dangerous; if you tell your AGI to prevent cancer, it will kill all the humans who could get cancer (or other literal genie hijinks). I think those fears are not realistic with a early-stage AGI in a slow takeoff. With an AGI not too far from human level, it would take time to do something big like cure cancer, so you’d want to have a conversation about how it understands the goal and what methods it will use before letting it use time and resources to research and make plans, and again before they’re executed (and probably many times in the middle). And even LLMs infer intent from instructions pretty well; they know that cure cancer means not killing the host.
In that same slow takeoff scenario that seems likely, concerns about it getting complex inference wrong are much more realistic. Humanity doesn’t know what its intent is, so that machine would have to be quite competent to deduce it correctly. The first AGIs seem likely to not be that smart at launch. The critical piece is that short-term personal intent includes an explicit instruction for the AGI to shut down for re-alignment; humanity’s intent will never be that specific, except for cases where it’s obvious that an action will be disastrous; and if the AGI understands that, it wouldn’t take that action anyway. So it seems to me that personal intent alignment allows a much better chance of adjusting imperfect alignment.
With regard to your “collective intent alignment”, would that be the same thing I’m calling value alignment? I don’t think humanity has a real collective short term intent on most matters; people want very different things. They’d agree on not having humanity made extinct, but beyond that, opinions on what people would like to see happen vary broadly. So any collective intent would seem to be longer term, and so vague as to be equivalent to values (people don’t know what they want in the long term specficially, but they have values that allow them to recognize futures they like or don’t like).
Anyway, I’m curious where and why your view differs. I take this question to be critically important, so working out whether it’s right is pretty critical.
Although I guess the counterargument to its importance is: even if collective alignment was just as easy, the people building AGI would probably align it to their interests instead, just because they like their value/intent more than others’.
I had trouble figuring out how to respond to this comment at the time because I couldn’t figure out what you meant by “value alignment” despite reading your linked post. After reading you latest post, Conflating value alignment and intent alignment is causing confusion, I still don’t know exactly what you mean by “value alignment” but at least can respond.
What I mean is:
If you start with an intent aligned AI following the most surface level desires/commands, you will want to make it safer and more useful by having common sense, “do what I mean”, etc. As long as you surface-level want it to understand and follow your meta-level desires, then it can step up that ladder etc.
If you have a definition of “value alignment” that is different from what you get from this process, then I currently don’t think that it is likely to be better than the alignment from the above process.
In the context of collective intent alignment:
If you have an AI that only follows commands, with no common sense etc., and it’s powerful enough to take over, you die. I’m pretty sure some really bad stuff is likely to happen even if you have some “standing orders”. So, I’m assuming people would actually deploy only an AI that has some understanding of what the person(s) it’s aligned with wants, beyond the mere text of a command (though not necessarily super-sophisticated). But once you have that, you can aggregate how much people want between humans for collective intent alignment.
I’m aware people want different things, but don’t think it’s a big problem from a technical (as opposed to social) perspective—you can ask how much people want the different things. Ambiguity in how to aggregate is unlikely to cause disaster, even if people will care about it a lot socially. Self-modification will cause a convergence here, to potentially different attractors depending on the starting position. Still unlikely to cause disaster. The AI will understand what people actually want from discussions with only a subset of the world’s population, which I also see as unlikely to cause disaster, even if people care about it socially.
From a social perspective, obviously a person or group who creates an AI may be tempted to create alignment to themselves only. I just don’t think collective alignment is significantly harder from a technical perspective.
“Standing orders” may be desirable initially as a sort of training wheels even with collective intent, and yes that could cause controversy as they’re likely not to originate from humanity collectively.
Interesting. It currently seems to me like collective intent alignment (which I think is what I’m calling value alignment? more below) is way harder than personal intent alignment. So I’m curious where our thinking differs.
I think people are going to want instruction following, not inferring intent from other sources, because they won’t trust the AGI to accurately infer intent that’s not explicitly stated.
I know that’s long been considered terribly dangerous; if you tell your AGI to prevent cancer, it will kill all the humans who could get cancer (or other literal genie hijinks). I think those fears are not realistic with a early-stage AGI in a slow takeoff. With an AGI not too far from human level, it would take time to do something big like cure cancer, so you’d want to have a conversation about how it understands the goal and what methods it will use before letting it use time and resources to research and make plans, and again before they’re executed (and probably many times in the middle). And even LLMs infer intent from instructions pretty well; they know that cure cancer means not killing the host.
In that same slow takeoff scenario that seems likely, concerns about it getting complex inference wrong are much more realistic. Humanity doesn’t know what its intent is, so that machine would have to be quite competent to deduce it correctly. The first AGIs seem likely to not be that smart at launch. The critical piece is that short-term personal intent includes an explicit instruction for the AGI to shut down for re-alignment; humanity’s intent will never be that specific, except for cases where it’s obvious that an action will be disastrous; and if the AGI understands that, it wouldn’t take that action anyway. So it seems to me that personal intent alignment allows a much better chance of adjusting imperfect alignment.
I discuss this more in Instruction-following AGI is easier and more likely than value aligned AGI.
With regard to your “collective intent alignment”, would that be the same thing I’m calling value alignment? I don’t think humanity has a real collective short term intent on most matters; people want very different things. They’d agree on not having humanity made extinct, but beyond that, opinions on what people would like to see happen vary broadly. So any collective intent would seem to be longer term, and so vague as to be equivalent to values (people don’t know what they want in the long term specficially, but they have values that allow them to recognize futures they like or don’t like).
Anyway, I’m curious where and why your view differs. I take this question to be critically important, so working out whether it’s right is pretty critical.
Although I guess the counterargument to its importance is: even if collective alignment was just as easy, the people building AGI would probably align it to their interests instead, just because they like their value/intent more than others’.
IMO: if an AI can trade off between different wants/values of one person, it can do so between multiple people also.
This applies to simple surface wants as well as deep values.
I had trouble figuring out how to respond to this comment at the time because I couldn’t figure out what you meant by “value alignment” despite reading your linked post. After reading you latest post, Conflating value alignment and intent alignment is causing confusion, I still don’t know exactly what you mean by “value alignment” but at least can respond.
What I mean is:
If you start with an intent aligned AI following the most surface level desires/commands, you will want to make it safer and more useful by having common sense, “do what I mean”, etc. As long as you surface-level want it to understand and follow your meta-level desires, then it can step up that ladder etc.
If you have a definition of “value alignment” that is different from what you get from this process, then I currently don’t think that it is likely to be better than the alignment from the above process.
In the context of collective intent alignment:
If you have an AI that only follows commands, with no common sense etc., and it’s powerful enough to take over, you die. I’m pretty sure some really bad stuff is likely to happen even if you have some “standing orders”. So, I’m assuming people would actually deploy only an AI that has some understanding of what the person(s) it’s aligned with wants, beyond the mere text of a command (though not necessarily super-sophisticated). But once you have that, you can aggregate how much people want between humans for collective intent alignment.
I’m aware people want different things, but don’t think it’s a big problem from a technical (as opposed to social) perspective—you can ask how much people want the different things. Ambiguity in how to aggregate is unlikely to cause disaster, even if people will care about it a lot socially. Self-modification will cause a convergence here, to potentially different attractors depending on the starting position. Still unlikely to cause disaster. The AI will understand what people actually want from discussions with only a subset of the world’s population, which I also see as unlikely to cause disaster, even if people care about it socially.
From a social perspective, obviously a person or group who creates an AI may be tempted to create alignment to themselves only. I just don’t think collective alignment is significantly harder from a technical perspective.
“Standing orders” may be desirable initially as a sort of training wheels even with collective intent, and yes that could cause controversy as they’re likely not to originate from humanity collectively.