I think OpenAI’s approach to “use AI to aid AI alignment” is pretty bad, but not for the broader reason you give here.
I think of most of the value from that strategy as downweighting probability for some bad properties—in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there’s less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model’s ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.
I don’t think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI’s approach because it views these as solutions in themselves, instead of as simplified problems.
I think OpenAI’s approach to “use AI to aid AI alignment” is pretty bad, but not for the broader reason you give here.
I think of most of the value from that strategy as downweighting probability for some bad properties—in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there’s less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model’s ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.
I don’t think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI’s approach because it views these as solutions in themselves, instead of as simplified problems.