ryan_greenblatt comments on Bengio’s Alignment Proposal: “Towards a Cautious Scientist AI with Convergent Safety Bounds”

ryan_greenblatt 3 Mar 2024 0:27 UTC
2 points
0
[A bunch of what I’m going to say is maybe obvious, but I’m uncertain what will and won’t be obvious, so I’m saying it anyway.]

E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.

We can already pick a threshold for a classifier trained “normally” and vary this threshold based on the context. (Or using an ensemble of classifiers.)

Within distribution, we should expect trained classifiers to be well calibrated. And, using online training, it will likely be possible to get good calibration on average despite the distribution shifting over time.

Of course, calibration can fail in rare cases (e.g. very strange OOD inputs) or if we want to generalize to some new distribution where we can’t obtain good labels and thus would prefer to just hope for good generalization.

For the exact ELK case we were talking about, we wanted to depend on generalization in this way (assuming we’re on the same page). More precisely, we wanted to train a reporter on a clean/easy set where we trust the labels, but the data is much easier than the data we actually care about and thus is OOD.

So, we might run into issues where our reporter fails to be calibrated (or even to have signal) off distribution and thus our classifier thresholds might not be very meaningful. (Of course, we can still do things like pay attention to the top 1% most harmful actions based on our reporter.)

Thus, you might hope you can solve these calibration issues with Baysian ML. But, bayesian ML also has no guarantees about avoiding rare failures or guarantees about having good generalization (guarantees here require imposing additional properties on the prior and the data). Further, as noted earlier, I think ensembling is basically just taking some number of samples of predictive models (hypotheses) from the posterior anyway.^[1]

If we assume that ensembling is just taking some number of samples from the posterior, then we can get relatively tight bounds on the probability that bayesian ML would estimate (with no additional assumptions) using Hoeffding’s inequality.

For an ensemble of N models, these bounds won’t allow us to bound probabilities lower than 1/N, but we’ll probably run into other issues before these 1/N bounds are the main blocker. (As in, imagine the case where all of our ensemble reports a tiny probability of harm. We can bound the probability that Bayesian ML would report >1/N probability of harm. This is again by Hoeffding’s inequality. If we wanted to bound probabilities to <<1/N, then our ensemble won’t work, but I also expect that you run into other issues such as hypotheses like “say there will be harm if humans would have been confused” which ensure that bayesian ML always indicates some probability of harm in the cases we need to handle when using generalization.)

(Separately, I think the deep neural networks are themselves relatively well described as ensembles such that the variance of samples from the posterior will be low in practice. This makes ensembling as an approximation to bayesian ML even better.)

(Perhaps you (and other people who think Bayesian ML would do something useful) don’t buy this “training a model is just sampling from the posterior” argument? Or perhaps you buy this argument, but you think we care a lot about probabilities <<1/N and also that we can rule out bad generalizations which will always cause noise in probability estimates this low?)

I think the reason Bayesian ML isn’t that widely used is because it’s intractable to do. So Bengio’s stuff would have to succesfully make it competitive with other methods.

I think it’s probably tractable to apply Bayesian ML to learn moderately-small reporter heads on top of existing neural networks. And, in practice, I expect this probably gets you almost all of the advantages of doing Bayesian ML over the entire neural network.

For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It’s just Bayesian Logistic Regression).

(When viewed from this perspective, I think that training a linear probe using the difference in means method will be something like the maximum likelihood estimator for this posterior. And, difference in means works well empirically and has other well motivated reasons for working.)

My understanding is that training tiny (e.g. 2 layers with n_embed=512) models (transformers, MLPs, whatever) with Bayesian ML is also doable, so we could just use such a tiny model as a reporter head and get most of the benefits of Bayesian ML.

I think the reason we don’t see Bayesian ML is that ensembles probably just work fine in practice.
1. ↩︎
  If we imagine that our prior is the space of (reporter) model initializations, then Bayesian ML will aim to approximate updating toward (reporter) models that performed well in training. Let’s simplify and imagine that instead Baysian ML is just approximating the distribution of initializations which get better training distribution performance than some threshold. We’ll refer to this distribution as a the posterior. I claim that training an ensemble of N (reporter) models from differential initializations is a very good approximation of sampling N (reporter) models from the posterior. I think this is basically a prediction of SLT and most other theories of how neural networks learn.
- LawrenceC 3 Mar 2024 2:05 UTC
  5 points
  0
  Parent
  For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It’s just Bayesian Logistic Regression).
  IIRC Adam Gleave tried this in summer of 2021 with one of Chinchilla/Gopher while he was interning at DeepMind, and this did not improve on ensembling for the tasks he considered.