I agree the goal is under-specified. With regard to meta-preferences, with some simplification it seems we have several basic possibilities
1. Align with the result of the internal aggregation (e.g. observe what does the corporation do)
2. Align with the result of the internal aggregation, by asking (e.g. ask the corporation via some official channel, let the sub-agents sort it out inside)
3. Learn about the sub-agents and try to incorporate their values (e.g. learn about the humans in the corporation)
4. Add layers of indirection, e.g. asking about meta-preferences
Unfortunately, I can imagine in case of humans, 4. can lead to various stable reflective equilibria of preferences and meta-preferences—for example, I can imagine, by suitable queries, you can get a human to want
to be aligned with explicit reasoning, putting most value on some conscious, model-based part of the mind; with meta-reasoning about VNM axioms, etc.
to be aligned with some heart&soul, putting value on universal love, transcendent joy, and the many parts of human mind which are not explicit, etc.
where both of these options would be self-consistently aligned with the meta-preferences the human will be expressing about how the sub-agent alignment should be done.
So even with meta-preferences, likely there are multiple ways
This is part of the problem I was trying to describe in multi-agent minds, part “what are we aligning the AI with”.
I agree the goal is under-specified. With regard to meta-preferences, with some simplification it seems we have several basic possibilities
1. Align with the result of the internal aggregation (e.g. observe what does the corporation do)
2. Align with the result of the internal aggregation, by asking (e.g. ask the corporation via some official channel, let the sub-agents sort it out inside)
3. Learn about the sub-agents and try to incorporate their values (e.g. learn about the humans in the corporation)
4. Add layers of indirection, e.g. asking about meta-preferences
Unfortunately, I can imagine in case of humans, 4. can lead to various stable reflective equilibria of preferences and meta-preferences—for example, I can imagine, by suitable queries, you can get a human to want
to be aligned with explicit reasoning, putting most value on some conscious, model-based part of the mind; with meta-reasoning about VNM axioms, etc.
to be aligned with some heart&soul, putting value on universal love, transcendent joy, and the many parts of human mind which are not explicit, etc.
where both of these options would be self-consistently aligned with the meta-preferences the human will be expressing about how the sub-agent alignment should be done.
So even with meta-preferences, likely there are multiple ways
Yes, almost certainly. That’s why I want to preserve all meta-preferences, at least to some degree.