It strikes me that there is a difficult problem involved in creating a system that can automatically perform useful alignment research, which is generally pretty speculative and theoretical, without that system just being generally skilled at reasoning/problem solving. I am sure they are aware of this, but I feel like it is a fundamental issue worth highlighting.
Still, it seems like the special case of “solve the alignment problem as it relates to an automated alignment researcher” might be easier than “solve alignment problem for reasoning systems generally”, so it is potentially a useful approach.
Anyone know what resources I could check out to see how they’re planning on designing, aligning, and getting useful work out of their auto-alignment researcher? I mean, they mention some of the techniques, but it still seems vague to me what kind of model they’re even talking about. Are they basically going to use an LLM fine-tuned on existing research and then use some kind of scalable oversight/”turbo-RLHF” training regime to try to push it towards more useful outputs or what?
It strikes me that there is a difficult problem involved in creating a system that can automatically perform useful alignment research, which is generally pretty speculative and theoretical, without that system just being generally skilled at reasoning/problem solving. I am sure they are aware of this, but I feel like it is a fundamental issue worth highlighting.
Still, it seems like the special case of “solve the alignment problem as it relates to an automated alignment researcher” might be easier than “solve alignment problem for reasoning systems generally”, so it is potentially a useful approach.
Anyone know what resources I could check out to see how they’re planning on designing, aligning, and getting useful work out of their auto-alignment researcher? I mean, they mention some of the techniques, but it still seems vague to me what kind of model they’re even talking about. Are they basically going to use an LLM fine-tuned on existing research and then use some kind of scalable oversight/”turbo-RLHF” training regime to try to push it towards more useful outputs or what?