My general concern with “using AGI to come up with better plans” is not that they will successfully manipulate us into misalignment, but Goodhart us into reinforcing stereotypes of “how good plan should look” or somewhat along this dimension, purely from how RLHF-style steering works.
Humans already do this, except we have made it politically incorrect to talk about the ways in which human-generated Goodhearting make the world worse (race, gender, politics etc)
Your examples are clearly visible. If your wrong alignment paradigm get reinforced because of your attachment to specific model of causality known to ten people in entire world, you risk to notice this too late.
My general concern with “using AGI to come up with better plans” is not that they will successfully manipulate us into misalignment, but Goodhart us into reinforcing stereotypes of “how good plan should look” or somewhat along this dimension, purely from how RLHF-style steering works.
Humans already do this, except we have made it politically incorrect to talk about the ways in which human-generated Goodhearting make the world worse (race, gender, politics etc)
Your examples are clearly visible. If your wrong alignment paradigm get reinforced because of your attachment to specific model of causality known to ten people in entire world, you risk to notice this too late.
you’re thinking about this the wrong way. AGI governance will not operate like human governance.
Can you elaborate? I don’t understand where we disagree.