Using RL(AI)F may offer a solution to all the points in this section: By starting with a set of established principles, AI can generate and revise a large number of prompts, selecting the best answers through a chain-of-thought process that adheres to these principles. Then, a reward model can be trained and the process can continue as in RLHF. This approach is potentially better than RLHF as it does not require human feedback.
I’d like to say that I fervently disagree with . Giving an unaligned AI the opportunity to modify its own weights (by categorizing its own responses to questions), then politely asking it to align itself, is quite possibly the worst alignment plan I’ve ever heard; it’s penny-wise, pound-foolish. (Assuming it even is penny-wise; I can think of several ways to generate a self-consistent AI that would cost less.)
I’d like to say that I fervently disagree with . Giving an unaligned AI the opportunity to modify its own weights (by categorizing its own responses to questions), then politely asking it to align itself, is quite possibly the worst alignment plan I’ve ever heard; it’s penny-wise, pound-foolish. (Assuming it even is penny-wise; I can think of several ways to generate a self-consistent AI that would cost less.)