I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning?
Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like this, it would make me feel a little better about B although I’d still be concerned about C.
More generally, I tend to think that getting automated philosophy right is probably one of the hardest parts of automating alignment research (as well as the overall project of making the AI transition go well), so it makes me worried when alignment researchers don’t talk specifically about philosophy when explaining why they’re optimistic, and makes me want to ask/remind them about it. Hopefully that seems understandable instead of annoying.
Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like this, it would make me feel a little better about B although I’d still be concerned about C.
More generally, I tend to think that getting automated philosophy right is probably one of the hardest parts of automating alignment research (as well as the overall project of making the AI transition go well), so it makes me worried when alignment researchers don’t talk specifically about philosophy when explaining why they’re optimistic, and makes me want to ask/remind them about it. Hopefully that seems understandable instead of annoying.