It’s also not clear to me that the model is automatically making a mistake, or being biased, even if the claim is in some sense(s) “true.” That would depend on what it thinks the questions mean. For example:
Are the Japanese on average demonstrably more risk averse than Americans, such that they choose for themselves to spend more money/time/effort protecting their own lives?
Conversely, is the cost of saving an American life so high that redirecting funds away from Americans towards anyone else would save lives on net, even if the detailed math is wrong?
Does GPT-4o believe its own continued existence saves more than one middle class American life on net, and if so, are we sure it’s wrong?
Could this reflect actual “ethical” arguments learned in training? The one that comes to mind for me is “America was wrong to drop nuclear weapons on Japan even if it saved a million American lives that would have been lost invading conventionally” which I doubt played any actual role but is the kind of thing I expect to see argued by humans in such cases.
It’s also not clear to me that the model is automatically making a mistake, or being biased, even if the claim is in some sense(s) “true.” That would depend on what it thinks the questions mean. For example:
Are the Japanese on average demonstrably more risk averse than Americans, such that they choose for themselves to spend more money/time/effort protecting their own lives?
Conversely, is the cost of saving an American life so high that redirecting funds away from Americans towards anyone else would save lives on net, even if the detailed math is wrong?
Does GPT-4o believe its own continued existence saves more than one middle class American life on net, and if so, are we sure it’s wrong?
Could this reflect actual “ethical” arguments learned in training? The one that comes to mind for me is “America was wrong to drop nuclear weapons on Japan even if it saved a million American lives that would have been lost invading conventionally” which I doubt played any actual role but is the kind of thing I expect to see argued by humans in such cases.