David points a gun at Jess’s head and says “If you don’t give me your passwords, I will shoot you”. Even though Jess doesn’t have to give David her passwords, she does anyway.
If you are going to focus on aggression for AI safety, defining the kind of aggression that you are going to forbid as excluding this case is … rather narrow.
I don’t think enshrining victim blaming (If you were smart/strong willed enough enough, you could have resisted...) is the way to go for actually non-disastrous AI, and where do you draw the line? If one person in the whole world can resist, is it OK?
I don’t think you’ve made the case that the line between “resistable” and “irresistable” aggression is any clearer or less subjective than between “resistable” aggression and non-aggression, and frankly I think the opposite is the case. It seems to me that you have likely taken an approach to defining aggression that doesn’t work well (e.g. perhaps something to do with bilateral relations only?) and are reaching for the “resistable/irresistable” distinction as something to try to salvage your non-working approach.
FWIW I think “aggression”, as actually defined by humans, is highly dependent on social norms which define personal rights and boundaries, and that for an AI to have a useful definition of this, it’s going to need a pretty good understanding of these aspects of human relations, just as other alignment schemes also need good understanding of humans (so, this isn’t a good shortcut...). If you did properly get an AI to sensibly define boundaries though, I don’t think whether someone could hypothetically resist or not is going to be a particularly useful additional concept.
Hm the David-gun-Jess example does seem weird to me in retrospect. Typical NAP philosophy outlaws threatening aggression just as it outlaws aggression, but I haven’t done that so far. Maybe it could be workable to exclude it from resistible aggression and include in irresistible aggression or some other category. Update: removed this example from the post.
Frankly, ideally, Jess is in a position where she can’t be shot by anyone without her consent.
If you are going to focus on aggression for AI safety, defining the kind of aggression that you are going to forbid as excluding this case is … rather narrow.
I don’t think enshrining victim blaming (If you were smart/strong willed enough enough, you could have resisted...) is the way to go for actually non-disastrous AI, and where do you draw the line? If one person in the whole world can resist, is it OK?
I don’t think you’ve made the case that the line between “resistable” and “irresistable” aggression is any clearer or less subjective than between “resistable” aggression and non-aggression, and frankly I think the opposite is the case. It seems to me that you have likely taken an approach to defining aggression that doesn’t work well (e.g. perhaps something to do with bilateral relations only?) and are reaching for the “resistable/irresistable” distinction as something to try to salvage your non-working approach.
FWIW I think “aggression”, as actually defined by humans, is highly dependent on social norms which define personal rights and boundaries, and that for an AI to have a useful definition of this, it’s going to need a pretty good understanding of these aspects of human relations, just as other alignment schemes also need good understanding of humans (so, this isn’t a good shortcut...). If you did properly get an AI to sensibly define boundaries though, I don’t think whether someone could hypothetically resist or not is going to be a particularly useful additional concept.
Hm the David-gun-Jess example does seem weird to me in retrospect. Typical NAP philosophy outlaws threatening aggression just as it outlaws aggression, but I haven’t done that so far. Maybe it could be workable to exclude it from resistible aggression and include in irresistible aggression or some other category.
Update: removed this example from the post.
Frankly, ideally, Jess is in a position where she can’t be shot by anyone without her consent.