I think a key danger here is that treatment of other agents wouldn’t transfer to humans, both because it’s inherently different and because humans themselves are likely to be on the belligerent side of the spectrum. But even so I think it’s a good start in defining an alignment function that doesn’t require explicitly encoding some particular form of human values.
To extend the approach to address this, I think we’d have to explicitly convey a message of the form “do not discriminate based on superficial traits, only choices”; eg, in addition to behavioral patterns, agents possess superficial traits that are visible to other agents, and are randomly assigned with no particular correlation with the behaviors.
I think a key danger here is that treatment of other agents wouldn’t transfer to humans, both because it’s inherently different and because humans themselves are likely to be on the belligerent side of the spectrum. But even so I think it’s a good start in defining an alignment function that doesn’t require explicitly encoding some particular form of human values.
To extend the approach to address this, I think we’d have to explicitly convey a message of the form “do not discriminate based on superficial traits, only choices”; eg, in addition to behavioral patterns, agents possess superficial traits that are visible to other agents, and are randomly assigned with no particular correlation with the behaviors.
Better yet, have the agents experience discrimination themselves to internalize the message that it is bad