I had trouble figuring out how to respond to this comment at the time because I couldn’t figure out what you meant by “value alignment” despite reading your linked post. After reading you latest post, Conflating value alignment and intent alignment is causing confusion, I still don’t know exactly what you mean by “value alignment” but at least can respond.
What I mean is:
If you start with an intent aligned AI following the most surface level desires/commands, you will want to make it safer and more useful by having common sense, “do what I mean”, etc. As long as you surface-level want it to understand and follow your meta-level desires, then it can step up that ladder etc.
If you have a definition of “value alignment” that is different from what you get from this process, then I currently don’t think that it is likely to be better than the alignment from the above process.
In the context of collective intent alignment:
If you have an AI that only follows commands, with no common sense etc., and it’s powerful enough to take over, you die. I’m pretty sure some really bad stuff is likely to happen even if you have some “standing orders”. So, I’m assuming people would actually deploy only an AI that has some understanding of what the person(s) it’s aligned with wants, beyond the mere text of a command (though not necessarily super-sophisticated). But once you have that, you can aggregate how much people want between humans for collective intent alignment.
I’m aware people want different things, but don’t think it’s a big problem from a technical (as opposed to social) perspective—you can ask how much people want the different things. Ambiguity in how to aggregate is unlikely to cause disaster, even if people will care about it a lot socially. Self-modification will cause a convergence here, to potentially different attractors depending on the starting position. Still unlikely to cause disaster. The AI will understand what people actually want from discussions with only a subset of the world’s population, which I also see as unlikely to cause disaster, even if people care about it socially.
From a social perspective, obviously a person or group who creates an AI may be tempted to create alignment to themselves only. I just don’t think collective alignment is significantly harder from a technical perspective.
“Standing orders” may be desirable initially as a sort of training wheels even with collective intent, and yes that could cause controversy as they’re likely not to originate from humanity collectively.
I had trouble figuring out how to respond to this comment at the time because I couldn’t figure out what you meant by “value alignment” despite reading your linked post. After reading you latest post, Conflating value alignment and intent alignment is causing confusion, I still don’t know exactly what you mean by “value alignment” but at least can respond.
What I mean is:
If you start with an intent aligned AI following the most surface level desires/commands, you will want to make it safer and more useful by having common sense, “do what I mean”, etc. As long as you surface-level want it to understand and follow your meta-level desires, then it can step up that ladder etc.
If you have a definition of “value alignment” that is different from what you get from this process, then I currently don’t think that it is likely to be better than the alignment from the above process.
In the context of collective intent alignment:
If you have an AI that only follows commands, with no common sense etc., and it’s powerful enough to take over, you die. I’m pretty sure some really bad stuff is likely to happen even if you have some “standing orders”. So, I’m assuming people would actually deploy only an AI that has some understanding of what the person(s) it’s aligned with wants, beyond the mere text of a command (though not necessarily super-sophisticated). But once you have that, you can aggregate how much people want between humans for collective intent alignment.
I’m aware people want different things, but don’t think it’s a big problem from a technical (as opposed to social) perspective—you can ask how much people want the different things. Ambiguity in how to aggregate is unlikely to cause disaster, even if people will care about it a lot socially. Self-modification will cause a convergence here, to potentially different attractors depending on the starting position. Still unlikely to cause disaster. The AI will understand what people actually want from discussions with only a subset of the world’s population, which I also see as unlikely to cause disaster, even if people care about it socially.
From a social perspective, obviously a person or group who creates an AI may be tempted to create alignment to themselves only. I just don’t think collective alignment is significantly harder from a technical perspective.
“Standing orders” may be desirable initially as a sort of training wheels even with collective intent, and yes that could cause controversy as they’re likely not to originate from humanity collectively.