I don’t think intent aligned AI has to be aligned to an individual—it can also be intent aligned to humanity collectively.
One thing I used to be concerned about is that collective intent alignment would be way harder than individual intent alignment, making someone validly have an excuse to steer an AI to their own personal intent. I no longer think this is the case. Most issues with collective intent I see as likely also affecting individual intent (e.g. literal instruction following vs extrapolation). I see two big issues that might make collective intent harder than individual intent. One is biased information on people’s intents and another is difficulty of weighting intents for different people. On reflection though, I see both as non-catastrophic, and an imperfect solution to them likely being better for humanity as a whole than following one person’s individual intent.
Interesting. It currently seems to me like collective intent alignment (which I think is what I’m calling value alignment? more below) is way harder than personal intent alignment. So I’m curious where our thinking differs.
I think people are going to want instruction following, not inferring intent from other sources, because they won’t trust the AGI to accurately infer intent that’s not explicitly stated.
I know that’s long been considered terribly dangerous; if you tell your AGI to prevent cancer, it will kill all the humans who could get cancer (or other literal genie hijinks). I think those fears are not realistic with a early-stage AGI in a slow takeoff. With an AGI not too far from human level, it would take time to do something big like cure cancer, so you’d want to have a conversation about how it understands the goal and what methods it will use before letting it use time and resources to research and make plans, and again before they’re executed (and probably many times in the middle). And even LLMs infer intent from instructions pretty well; they know that cure cancer means not killing the host.
In that same slow takeoff scenario that seems likely, concerns about it getting complex inference wrong are much more realistic. Humanity doesn’t know what its intent is, so that machine would have to be quite competent to deduce it correctly. The first AGIs seem likely to not be that smart at launch. The critical piece is that short-term personal intent includes an explicit instruction for the AGI to shut down for re-alignment; humanity’s intent will never be that specific, except for cases where it’s obvious that an action will be disastrous; and if the AGI understands that, it wouldn’t take that action anyway. So it seems to me that personal intent alignment allows a much better chance of adjusting imperfect alignment.
With regard to your “collective intent alignment”, would that be the same thing I’m calling value alignment? I don’t think humanity has a real collective short term intent on most matters; people want very different things. They’d agree on not having humanity made extinct, but beyond that, opinions on what people would like to see happen vary broadly. So any collective intent would seem to be longer term, and so vague as to be equivalent to values (people don’t know what they want in the long term specficially, but they have values that allow them to recognize futures they like or don’t like).
Anyway, I’m curious where and why your view differs. I take this question to be critically important, so working out whether it’s right is pretty critical.
Although I guess the counterargument to its importance is: even if collective alignment was just as easy, the people building AGI would probably align it to their interests instead, just because they like their value/intent more than others’.
I don’t think intent aligned AI has to be aligned to an individual—it can also be intent aligned to humanity collectively.
One thing I used to be concerned about is that collective intent alignment would be way harder than individual intent alignment, making someone validly have an excuse to steer an AI to their own personal intent. I no longer think this is the case. Most issues with collective intent I see as likely also affecting individual intent (e.g. literal instruction following vs extrapolation). I see two big issues that might make collective intent harder than individual intent. One is biased information on people’s intents and another is difficulty of weighting intents for different people. On reflection though, I see both as non-catastrophic, and an imperfect solution to them likely being better for humanity as a whole than following one person’s individual intent.
Interesting. It currently seems to me like collective intent alignment (which I think is what I’m calling value alignment? more below) is way harder than personal intent alignment. So I’m curious where our thinking differs.
I think people are going to want instruction following, not inferring intent from other sources, because they won’t trust the AGI to accurately infer intent that’s not explicitly stated.
I know that’s long been considered terribly dangerous; if you tell your AGI to prevent cancer, it will kill all the humans who could get cancer (or other literal genie hijinks). I think those fears are not realistic with a early-stage AGI in a slow takeoff. With an AGI not too far from human level, it would take time to do something big like cure cancer, so you’d want to have a conversation about how it understands the goal and what methods it will use before letting it use time and resources to research and make plans, and again before they’re executed (and probably many times in the middle). And even LLMs infer intent from instructions pretty well; they know that cure cancer means not killing the host.
In that same slow takeoff scenario that seems likely, concerns about it getting complex inference wrong are much more realistic. Humanity doesn’t know what its intent is, so that machine would have to be quite competent to deduce it correctly. The first AGIs seem likely to not be that smart at launch. The critical piece is that short-term personal intent includes an explicit instruction for the AGI to shut down for re-alignment; humanity’s intent will never be that specific, except for cases where it’s obvious that an action will be disastrous; and if the AGI understands that, it wouldn’t take that action anyway. So it seems to me that personal intent alignment allows a much better chance of adjusting imperfect alignment.
I discuss this more in Instruction-following AGI is easier and more likely than value aligned AGI.
With regard to your “collective intent alignment”, would that be the same thing I’m calling value alignment? I don’t think humanity has a real collective short term intent on most matters; people want very different things. They’d agree on not having humanity made extinct, but beyond that, opinions on what people would like to see happen vary broadly. So any collective intent would seem to be longer term, and so vague as to be equivalent to values (people don’t know what they want in the long term specficially, but they have values that allow them to recognize futures they like or don’t like).
Anyway, I’m curious where and why your view differs. I take this question to be critically important, so working out whether it’s right is pretty critical.
Although I guess the counterargument to its importance is: even if collective alignment was just as easy, the people building AGI would probably align it to their interests instead, just because they like their value/intent more than others’.