This post proposes to make AIs more ethical by putting ethics into Bayesian priors. Unfortunately, the suggestions for how to get ethics into the priors amount to existing ideas for how to get ethics into the learned models: IE, learn from data and human feedback. Putting the result into a prior appears to add technical difficulty without any given explanation for why it would improve things. Indeed, of the technical proposals for getting the information into a prior, the one most strongly endorsed by the post is to use the learned model as initial weights for further learning. This amounts to a reversal of current methods for improving the behaviors of LLMs, which first perform generative pre-training, and then use methods such as RLHF to refine the behavior. The proposal appears roughly to be: use RLHF first, and then do the rest of training later. This seems unlikely to work.
(Elsewhere, the concept of using the learned model to fine-tune GPT is mentioned, which appears to entirely throw away the goal of incorporating information into a prior, and instead more or less re-state RLHF.)
I agree that “learning the prior”, while contradictory on its face, in fact constitutes a valuable and non-vacuous direction of research. However, I think this proposal trivializes it by failing to recognize what makes such an approach different from simple object-level learning. It doesn’t make sense to learn-the-prior in cases where the same data could be used to directly train the system to similar or better effect, with less technical breakthroughs. The critical role played by learning-the-prior is learning how to update in response to data when no clear feedback signal is present to tell us what direction to update in. For example, humans are not always very good at articulating our preferences, so it’s not possible to directly train on the objective of satisfying human preferences, even given human feedback. Without further refinement to our methods, it makes sense to expect highly intelligent RLHF models in the future to reward-hack, doing things which would achieve high human feedback without actually satisfying human preferences. It would make sense to propose learning-the-prior type solutions to this problem; but in order to do so, the prior must learn how to adjust for errors in human feedback—a problem not even mentioned in the post here.
Another key aspect of priors not mentioned here is that they must evaluate models and assign them a score (a prior probability). The text does not flatly contradict this, but on my reading, it seems entirely unaware of this. To pick one example of many:
For “respect for life”, gather situations exemplifying respectful/disrespectful actions towards human well-being.
Here, the author proposes training a prior by collecting example situations to use as training data, as if we are trying to score situations.
In contrast, “learning an ethical prior” suggests learning how to score models (EG, artificial neural networks) by examining them and assigning them a score (eg, a “respect for life” score). This is a challenging and important problem, but the post as written appears to have no awareness of it, much less a plausible proposal. The implicit plan appears to be to estimate traits such as respect-for-life by running a model on scenarios and checking for its agreement with human judges, which eliminates what would be useful about learning the prior as opposed to simple learning.
This post proposes to make AIs more ethical by putting ethics into Bayesian priors. Unfortunately, the suggestions for how to get ethics into the priors amount to existing ideas for how to get ethics into the learned models: IE, learn from data and human feedback. Putting the result into a prior appears to add technical difficulty without any given explanation for why it would improve things. Indeed, of the technical proposals for getting the information into a prior, the one most strongly endorsed by the post is to use the learned model as initial weights for further learning. This amounts to a reversal of current methods for improving the behaviors of LLMs, which first perform generative pre-training, and then use methods such as RLHF to refine the behavior. The proposal appears roughly to be: use RLHF first, and then do the rest of training later. This seems unlikely to work.
(Elsewhere, the concept of using the learned model to fine-tune GPT is mentioned, which appears to entirely throw away the goal of incorporating information into a prior, and instead more or less re-state RLHF.)
I agree that “learning the prior”, while contradictory on its face, in fact constitutes a valuable and non-vacuous direction of research. However, I think this proposal trivializes it by failing to recognize what makes such an approach different from simple object-level learning. It doesn’t make sense to learn-the-prior in cases where the same data could be used to directly train the system to similar or better effect, with less technical breakthroughs. The critical role played by learning-the-prior is learning how to update in response to data when no clear feedback signal is present to tell us what direction to update in. For example, humans are not always very good at articulating our preferences, so it’s not possible to directly train on the objective of satisfying human preferences, even given human feedback. Without further refinement to our methods, it makes sense to expect highly intelligent RLHF models in the future to reward-hack, doing things which would achieve high human feedback without actually satisfying human preferences. It would make sense to propose learning-the-prior type solutions to this problem; but in order to do so, the prior must learn how to adjust for errors in human feedback—a problem not even mentioned in the post here.
Another key aspect of priors not mentioned here is that they must evaluate models and assign them a score (a prior probability). The text does not flatly contradict this, but on my reading, it seems entirely unaware of this. To pick one example of many:
Here, the author proposes training a prior by collecting example situations to use as training data, as if we are trying to score situations.
In contrast, “learning an ethical prior” suggests learning how to score models (EG, artificial neural networks) by examining them and assigning them a score (eg, a “respect for life” score). This is a challenging and important problem, but the post as written appears to have no awareness of it, much less a plausible proposal. The implicit plan appears to be to estimate traits such as respect-for-life by running a model on scenarios and checking for its agreement with human judges, which eliminates what would be useful about learning the prior as opposed to simple learning.