If human values are too complex to reimplement manually (which seems to be the case), you have to create a tool with the capability to do that automatically.
You can’t prove it works before running it in that case. Human values are not some kind of fractal pattern, where something complicated can be generated according to simple rules. In your proposal, the AI would have to learn human values somehow, which means it will have some indicator or another that it’s getting closer to human values (e.g. smiling humans), which will then be susceptible to wire-heading. Having the AI make inferences from a large corpus of human writing might work.
You can’t prove it works before running it in that case. Human values are not some kind of fractal pattern, where something complicated can be generated according to simple rules. In your proposal, the AI would have to learn human values somehow, which means it will have some indicator or another that it’s getting closer to human values (e.g. smiling humans), which will then be susceptible to wire-heading. Having the AI make inferences from a large corpus of human writing might work.