I have now read the paper, and still think you did a great job.
One gripe I have is with this framing:
We believe our articulation of human values as constitutive attentional policies is much closer to “what we really care about”, and is thus less prone to over-optimization
If you were to heavily optimize for text that humans would rate highly on specific values, you would run into the usual problems (e.g. model incentivized to manipulate the human). Your success here doesn’t come from the formulation of the values per se, but rather from the architecture that turns them into text/actions—rather than optimizing for them directly, you can prompt a LLM that’s anchored on normal human text to mildly optimize them for you.
This difference implies some important points about scaling to more intelligent systems (even without making any big pivots):
we don’t want the model to optimize for the stated values unboundedly hard, so we’ll have to end up asking for something mild and human-anchored more explicitly.
If another use of AI is proposing changes to the moral graph, we don’t want that process to form an optimization feedback loop (unless we’re really sure).
The main difference made by the choice of format of values is where to draw the boundary between legible human deliberation, and illegible LLM common sense.
I’m excited for future projects that are sort of in this vein but try to tackle moral conflict, or that try to use continuous rather than discrete prompts that can interpolate values, or explore different sorts of training of the illegible-common-sense part, or any of a dozen other things.
I have now read the paper, and still think you did a great job.
One gripe I have is with this framing:
If you were to heavily optimize for text that humans would rate highly on specific values, you would run into the usual problems (e.g. model incentivized to manipulate the human). Your success here doesn’t come from the formulation of the values per se, but rather from the architecture that turns them into text/actions—rather than optimizing for them directly, you can prompt a LLM that’s anchored on normal human text to mildly optimize them for you.
This difference implies some important points about scaling to more intelligent systems (even without making any big pivots):
we don’t want the model to optimize for the stated values unboundedly hard, so we’ll have to end up asking for something mild and human-anchored more explicitly.
If another use of AI is proposing changes to the moral graph, we don’t want that process to form an optimization feedback loop (unless we’re really sure).
The main difference made by the choice of format of values is where to draw the boundary between legible human deliberation, and illegible LLM common sense.
I’m excited for future projects that are sort of in this vein but try to tackle moral conflict, or that try to use continuous rather than discrete prompts that can interpolate values, or explore different sorts of training of the illegible-common-sense part, or any of a dozen other things.