I think the intersection of ethics / moral philosophy / normative descriptions of the current ethics of specific subcultures / etc is a particularly important topic. I’m excited also that it seems to involve a very different skill set than other important tasks in AI alignment like interpretability work. I’ve read some great work that’s already been done on ethics alignment, and I’m sure there’s more I don’t know of since this isn’t my specific area of focus. I think it’d be a valuable contribution to the alignment community for someone to do a thorough literature analysis of ethics alignment.
Thanks for the pointer to the paper, saved for later! I think this task of crafting machine-readable representations of human values is a thorny step in any CEV-like/value-loading proposal which doesn’t involve the AI inferring them itself IRL-style.
I was considering sifting through literature to form a model of ways people tried to do this in an abstract sense. Like, some approaches aim at a fixed normative framework. Others involve an uncertain seed which is collapsed to a likely framework. Others involve extrapolating from fixed to an uncertain distribution of possible places an initial framework drifted towards. Does this happen to ring a bell about any other references?
I think the intersection of ethics / moral philosophy / normative descriptions of the current ethics of specific subcultures / etc is a particularly important topic. I’m excited also that it seems to involve a very different skill set than other important tasks in AI alignment like interpretability work. I’ve read some great work that’s already been done on ethics alignment, and I’m sure there’s more I don’t know of since this isn’t my specific area of focus. I think it’d be a valuable contribution to the alignment community for someone to do a thorough literature analysis of ethics alignment.
Example paper: Aligning AI With Shared Human Values
Thanks for the pointer to the paper, saved for later! I think this task of crafting machine-readable representations of human values is a thorny step in any CEV-like/value-loading proposal which doesn’t involve the AI inferring them itself IRL-style.
I was considering sifting through literature to form a model of ways people tried to do this in an abstract sense. Like, some approaches aim at a fixed normative framework. Others involve an uncertain seed which is collapsed to a likely framework. Others involve extrapolating from fixed to an uncertain distribution of possible places an initial framework drifted towards. Does this happen to ring a bell about any other references?