The AI scientist idea does not bypass ELK, it has to face the difficulty head on. From the AI scientist blog post (emphasis mine):
One way to think of the AI Scientist is like a human scientist in the domain of pure physics, who never does any experiment. Such an AI reads a lot, in particular it knows about all the scientific litterature and any other kind of observational data, including about the experiments performed by humans in the world. From this, it deduces potential theories that are consistent with all these observations and experimental results. The theories it generates could be broken down into digestible pieces comparable to scientific papers, and we may be able to constrain it to express its theories in a human-understandable language (which includes natural language, scientific jargon, mathematics and programming languages). Such papers could be extremely useful if they allow to push the boundaries of scientific knowledge, especially in directions that matter to us, like healthcare, climate change or the UN SDGs.
If you get an AI scientist like that, clearly you need to have solved ELK. And I don’t think this would be a natural byproduct of the Bayesian approach (big Bayesian NNs don’t produce artifacts more legible than regular NNs, right?).
I agree this agenda wants to solve ELK by making legible predictive models.
But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.
I guess I don’t understand how it would even be possible to be conservative in the right way if you don’t solve ELK: just because the network is Bayesian doesn’t mean it can’t be scheming, and thus be conservative in the right way on the training distribution but fail catastrophically later, right? What difference can the training process and the overseer make between “the model is confident and correct that doing X is totally safe” (but humans don’t know why) and “the model tells you it’s confident that doing X is totally safe—but it’s actually false”? Where do you get the information that distinguishes “confident OOD prediction because it’s correct” from “the model is confidently lying OOD”? It feels like to solve this kind of issue, you ultimately have to rely on ELK (or abstain from making predictions in domains humans don’t understand, or ignore the possibility of scheming).
I’m skeptical this realistically improves much over doing normal ensembling (ensembling is basically just taking a small number of samples from the posterior anyway).
It could in principle be better, but this would require that estimating relatively low measure predictions is key (predictions which wouldn’t come up in a reasonably sized ensemble). I would guess that if you pause actions if there is any low measure hypothesis which indicates harm, you’ll be paralyzed with inaction (because the hypothesis “predict harm whenever a human wouldn’t understand what’s going on” isn’t going to be that low measure).
If it did improve over normal ensembling, why wouldn’t Bayesian ML already be relatively widely used?
If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
I think the reason Bayesian ML isn’t that widely used is because it’s intractable to do. So Bengio’s stuff would have to succesfully make it competitive with other methods.
[A bunch of what I’m going to say is maybe obvious, but I’m uncertain what will and won’t be obvious, so I’m saying it anyway.]
E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
We can already pick a threshold for a classifier trained “normally” and vary this threshold based on the context. (Or using an ensemble of classifiers.)
Within distribution, we should expect trained classifiers to be well calibrated. And, using online training, it will likely be possible to get good calibration on average despite the distribution shifting over time.
Of course, calibration can fail in rare cases (e.g. very strange OOD inputs) or if we want to generalize to some new distribution where we can’t obtain good labels and thus would prefer to just hope for good generalization.
For the exact ELK case we were talking about, we wanted to depend on generalization in this way (assuming we’re on the same page). More precisely, we wanted to train a reporter on a clean/easy set where we trust the labels, but the data is much easier than the data we actually care about and thus is OOD.
So, we might run into issues where our reporter fails to be calibrated (or even to have signal) off distribution and thus our classifier thresholds might not be very meaningful. (Of course, we can still do things like pay attention to the top 1% most harmful actions based on our reporter.)
Thus, you might hope you can solve these calibration issues with Baysian ML. But, bayesian ML also has no guarantees about avoiding rare failures or guarantees about having good generalization (guarantees here require imposing additional properties on the prior and the data). Further, as noted earlier, I think ensembling is basically just taking some number of samples of predictive models (hypotheses) from the posterior anyway.[1]
If we assume that ensembling is just taking some number of samples from the posterior, then we can get relatively tight bounds on the probability that bayesian ML would estimate (with no additional assumptions) using Hoeffding’s inequality.
For an ensemble of N models, these bounds won’t allow us to bound probabilities lower than 1/N, but we’ll probably run into other issues before these 1/N bounds are the main blocker. (As in, imagine the case where all of our ensemble reports a tiny probability of harm. We can bound the probability that Bayesian ML would report >1/N probability of harm. This is again by Hoeffding’s inequality. If we wanted to bound probabilities to <<1/N, then our ensemble won’t work, but I also expect that you run into other issues such as hypotheses like “say there will be harm if humans would have been confused” which ensure that bayesian ML always indicates some probability of harm in the cases we need to handle when using generalization.)
(Separately, I think the deep neural networks are themselves relatively well described as ensembles such that the variance of samples from the posterior will be low in practice. This makes ensembling as an approximation to bayesian ML even better.)
(Perhaps you (and other people who think Bayesian ML would do something useful) don’t buy this “training a model is just sampling from the posterior” argument? Or perhaps you buy this argument, but you think we care a lot about probabilities <<1/N and also that we can rule out bad generalizations which will always cause noise in probability estimates this low?)
I think the reason Bayesian ML isn’t that widely used is because it’s intractable to do. So Bengio’s stuff would have to succesfully make it competitive with other methods.
I think it’s probably tractable to apply Bayesian ML to learn moderately-small reporter heads on top of existing neural networks. And, in practice, I expect this probably gets you almost all of the advantages of doing Bayesian ML over the entire neural network.
For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It’s just Bayesian Logistic Regression).
(When viewed from this perspective, I think that training a linear probe using the difference in means method will be something like the maximum likelihood estimator for this posterior. And, difference in means works well empirically and has other well motivated reasons for working.)
My understanding is that training tiny (e.g. 2 layers with n_embed=512) models (transformers, MLPs, whatever) with Bayesian ML is also doable, so we could just use such a tiny model as a reporter head and get most of the benefits of Bayesian ML.
I think the reason we don’t see Bayesian ML is that ensembles probably just work fine in practice.
If we imagine that our prior is the space of (reporter) model initializations, then Bayesian ML will aim to approximate updating toward (reporter) models that performed well in training. Let’s simplify and imagine that instead Baysian ML is just approximating the distribution of initializations which get better training distribution performance than some threshold. We’ll refer to this distribution as a the posterior. I claim that training an ensemble of N (reporter) models from differential initializations is a very good approximation of sampling N (reporter) models from the posterior. I think this is basically a prediction of SLT and most other theories of how neural networks learn.
For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It’s just Bayesian Logistic Regression).
IIRC Adam Gleave tried this in summer of 2021 with one of Chinchilla/Gopher while he was interning at DeepMind, and this did not improve on ensembling for the tasks he considered.
The AI scientist idea does not bypass ELK, it has to face the difficulty head on. From the AI scientist blog post (emphasis mine):
If you get an AI scientist like that, clearly you need to have solved ELK. And I don’t think this would be a natural byproduct of the Bayesian approach (big Bayesian NNs don’t produce artifacts more legible than regular NNs, right?).
I agree this agenda wants to solve ELK by making legible predictive models.
But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.
I guess I don’t understand how it would even be possible to be conservative in the right way if you don’t solve ELK: just because the network is Bayesian doesn’t mean it can’t be scheming, and thus be conservative in the right way on the training distribution but fail catastrophically later, right? What difference can the training process and the overseer make between “the model is confident and correct that doing X is totally safe” (but humans don’t know why) and “the model tells you it’s confident that doing X is totally safe—but it’s actually false”? Where do you get the information that distinguishes “confident OOD prediction because it’s correct” from “the model is confidently lying OOD”? It feels like to solve this kind of issue, you ultimately have to rely on ELK (or abstain from making predictions in domains humans don’t understand, or ignore the possibility of scheming).
I’m skeptical this realistically improves much over doing normal ensembling (ensembling is basically just taking a small number of samples from the posterior anyway).
It could in principle be better, but this would require that estimating relatively low measure predictions is key (predictions which wouldn’t come up in a reasonably sized ensemble). I would guess that if you pause actions if there is any low measure hypothesis which indicates harm, you’ll be paralyzed with inaction (because the hypothesis “predict harm whenever a human wouldn’t understand what’s going on” isn’t going to be that low measure).
If it did improve over normal ensembling, why wouldn’t Bayesian ML already be relatively widely used?
If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
I think the reason Bayesian ML isn’t that widely used is because it’s intractable to do. So Bengio’s stuff would have to succesfully make it competitive with other methods.
[A bunch of what I’m going to say is maybe obvious, but I’m uncertain what will and won’t be obvious, so I’m saying it anyway.]
We can already pick a threshold for a classifier trained “normally” and vary this threshold based on the context. (Or using an ensemble of classifiers.)
Within distribution, we should expect trained classifiers to be well calibrated. And, using online training, it will likely be possible to get good calibration on average despite the distribution shifting over time.
Of course, calibration can fail in rare cases (e.g. very strange OOD inputs) or if we want to generalize to some new distribution where we can’t obtain good labels and thus would prefer to just hope for good generalization.
For the exact ELK case we were talking about, we wanted to depend on generalization in this way (assuming we’re on the same page). More precisely, we wanted to train a reporter on a clean/easy set where we trust the labels, but the data is much easier than the data we actually care about and thus is OOD.
So, we might run into issues where our reporter fails to be calibrated (or even to have signal) off distribution and thus our classifier thresholds might not be very meaningful. (Of course, we can still do things like pay attention to the top 1% most harmful actions based on our reporter.)
Thus, you might hope you can solve these calibration issues with Baysian ML. But, bayesian ML also has no guarantees about avoiding rare failures or guarantees about having good generalization (guarantees here require imposing additional properties on the prior and the data). Further, as noted earlier, I think ensembling is basically just taking some number of samples of predictive models (hypotheses) from the posterior anyway.[1]
If we assume that ensembling is just taking some number of samples from the posterior, then we can get relatively tight bounds on the probability that bayesian ML would estimate (with no additional assumptions) using Hoeffding’s inequality.
For an ensemble of N models, these bounds won’t allow us to bound probabilities lower than 1/N, but we’ll probably run into other issues before these 1/N bounds are the main blocker. (As in, imagine the case where all of our ensemble reports a tiny probability of harm. We can bound the probability that Bayesian ML would report >1/N probability of harm. This is again by Hoeffding’s inequality. If we wanted to bound probabilities to <<1/N, then our ensemble won’t work, but I also expect that you run into other issues such as hypotheses like “say there will be harm if humans would have been confused” which ensure that bayesian ML always indicates some probability of harm in the cases we need to handle when using generalization.)
(Separately, I think the deep neural networks are themselves relatively well described as ensembles such that the variance of samples from the posterior will be low in practice. This makes ensembling as an approximation to bayesian ML even better.)
(Perhaps you (and other people who think Bayesian ML would do something useful) don’t buy this “training a model is just sampling from the posterior” argument? Or perhaps you buy this argument, but you think we care a lot about probabilities <<1/N and also that we can rule out bad generalizations which will always cause noise in probability estimates this low?)
I think it’s probably tractable to apply Bayesian ML to learn moderately-small reporter heads on top of existing neural networks. And, in practice, I expect this probably gets you almost all of the advantages of doing Bayesian ML over the entire neural network.
For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It’s just Bayesian Logistic Regression).
(When viewed from this perspective, I think that training a linear probe using the difference in means method will be something like the maximum likelihood estimator for this posterior. And, difference in means works well empirically and has other well motivated reasons for working.)
My understanding is that training tiny (e.g. 2 layers with n_embed=512) models (transformers, MLPs, whatever) with Bayesian ML is also doable, so we could just use such a tiny model as a reporter head and get most of the benefits of Bayesian ML.
I think the reason we don’t see Bayesian ML is that ensembles probably just work fine in practice.
If we imagine that our prior is the space of (reporter) model initializations, then Bayesian ML will aim to approximate updating toward (reporter) models that performed well in training. Let’s simplify and imagine that instead Baysian ML is just approximating the distribution of initializations which get better training distribution performance than some threshold. We’ll refer to this distribution as a the posterior. I claim that training an ensemble of N (reporter) models from differential initializations is a very good approximation of sampling N (reporter) models from the posterior. I think this is basically a prediction of SLT and most other theories of how neural networks learn.
IIRC Adam Gleave tried this in summer of 2021 with one of Chinchilla/Gopher while he was interning at DeepMind, and this did not improve on ensembling for the tasks he considered.