It may have more impact at intermediate stages of AI training, where the AI is smart enough to do value reflection and contemplate self-modification/hacking the training process, but not smart enough to immediately figure out all the ways it can go wrong.
E. g., it can come to wrong conclusions about its own values, like humans can, then lock these misunderstandings in by maneuvering the training process this way, or by designing a successor agent with the wrong values. Or it may design sub-agents without understanding the various pitfalls of multi-agent systems, and get taken over by some Goodharter.
I agree that “what goes into the training set” is a minor concern, though, even with regards to influencing the aforementioned dynamic.
It may have more impact at intermediate stages of AI training, where the AI is smart enough to do value reflection and contemplate self-modification/hacking the training process, but not smart enough to immediately figure out all the ways it can go wrong.
E. g., it can come to wrong conclusions about its own values, like humans can, then lock these misunderstandings in by maneuvering the training process this way, or by designing a successor agent with the wrong values. Or it may design sub-agents without understanding the various pitfalls of multi-agent systems, and get taken over by some Goodharter.
I agree that “what goes into the training set” is a minor concern, though, even with regards to influencing the aforementioned dynamic.