One sort of answer is that we often want the posterior, and we often have the likelihood. Slightly more refined: we often find the likelihood easier to estimate than the posterior, so Bayes’ Rule is useful.
Why so?
I think one reason is that we make it the “responsibility” of hypotheses to give their likelihood functions. After all, what is a hypothesis? It’s just a probability distribution (not a probability distribution that we necessarily endorse, but, one which we are considering as a candidate). As a probability distribution, its job is to make predictions; that is, give us probabilities for possible observations. These are the likelihoods.
We want the posterior because it tells us how much faith to place in the various hypotheses—that is, it tells us whether (and to what degree) we should trust the various probability distributions we were considering.
So, in some sense, we use Bayes’ Rule because we aren’t sure how to assign probabilities, but we can come up with several candidate options.
One weak counterexample to this story is regression, IE, curve-fitting. We can interpret regression in a Bayesian way easily enough. However, the curves don’t come with likelihoods baked in. They only tell us how to interpolate/extrapolate with point-estimates; they don’t give a full probability distribution. We’ve got to “soften” these predictions, layering probabilities on top, in order to apply the Bayesian way of thinking.
Is the idea to train with high beta and then use lower beta post-training?
If so, how does this relate to reward hacking and value preservation? IE, where do V1 and V2 come from, if they aren’t the result of a further training step? If high beta is used during training (to achieve beta-coherence) and then low beta is used in production, then the choice between V1 and V2 must be made in production (since it is made with low beta), but then it seems like V1=V2.
If not, then when does the proposal suggest to use high beta vs low beta? If low beta is used during training, then how is it that V is coherent with respect to high beta instead?
Another concern I have is that if both beta values are within a range that can yield useful capabilities, it seems like the difference cannot be too great. IIUC, the planning failure postulated can only manifest if the reward-hacking relies heavily on a long string of near-optimal actions, which becomes improbable under increased temperature. Any capabilities which similarly rely on long strings of near-optimal actions will similarly be hurt. (However, this concern is secondary to my main confusion.)
Trained with what procedure, exactly?
(These parts made sense to me modulo my other questions/concerns/confusions.)