It currently seems to me like almost all the interesting work is in the step where we need to know whether a hypothesis implies harm.
Putting this in language which makes the situation more clear to me, “we need to know whether a given predictative model predicts/‘thinks’ that a given action will result in bad outcomes (despite ourselves not knowing this and despite not having access to reliable short term measurements)”.
As far as I can, the proposal for ensuring that we know whether a given predictive model ‘thinks’ that a given action will result in bad outcomes is to ensure that the predictive models are sufficiently human interpretable.
(This is broadly equivalent to imitative generalization. Note that various worst-case-theory counterexamples to using imitative generalization are in the ELK report along with other discussion.)
Given that so much of the action in having human intepretable predictative models (again, these would be called hypotheses in the language of this post), I’d be excited about work demonstating this in hard cases where the most competitive approach would be to train a black box predictor.
For instance, I’d be excited about work demonstrating the ability to produce human interpretable predictive models which allow humans to perform much better at next token prediction while also understanding what’s going on.
While tasks like next token prediction are relatively straightforward to set up, the most analogous cases would probably be cases where we want to train an AI agent, but the agent could tamper with our measurements[2] and humans struggle to understand if the measurements were tampered with based on looking at the actions of the AI agent.
I’m somewhat nervous that in practice, work on this agenda will almost entirely focus on “reliable probabilistic inference” component even though this component doesn’t seem to be doing most of the work.
(If reliable probabilistic inference failed, but we had a methodology for finding some predictive hypotheses that humans understand, that would be a very valuable contribution. But probabilistic inference with asymptotic guarantees doesn’t clearly get us much on its own.)
[My thoughts here are partially based on reading some comments on this agenda I’ve seen elsewhere and a variety of other discussion with people, but I’m unsure if these individuals would like their name appear in this comment, so I haven’t included acknowledgements. All mistakes are my own.]
It’s a very general version of ELK as it needs to handle all bad outcomes, in principle including bad outcomes which take a very long time to manifest. (Unlike the diamond in the vault example from the ELK report where we just care about an outcome that occurs on a short time horizon.)
Note that the linked work (by myself and collaborators) doesn’t have the property that humans would have a hard knowing if the measurements were tampered with based on inspecting the action.
IIUC, I think that in addition to making predictive models more human interpretable, there’s another way this agenda aspires to get around the ELK problem.
Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.
IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the “right” notion of what constitutes a bad outcome, rather than a wrong one like “a human’s best guess would be that this outcome is bad”. I think Bengio’s proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the “right” notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.
EDIT: from a quick look at the ELK report, this idea (“ensembling”) is mentioned under “How we’d approach ELK in practice”. Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it’s plausible to me that this agenda could make progress on ELK even if it wasn’t specifically thinking about the problem.
If the claim is that it’s easier to learn a covering set for a “true” harm predicate and then act conservatively wrt the set, than to learn a single harm predicate, is not a new approach. E.g. just from CHAI:
The Inverse Reward Design paper, which tries to straightforwardly implement the posterior P(True Reward|Human specification of Reward) and act conservatively wrt to this outcome.
[A bunch of other papers which consider this for Reward | Another Source of Evidence, including: randomly sampled rewards and human off-switch pressing, . Also the CIRL paper, which proposes using uncertainty to directly solve the meta problem of “the thing this uncertainty is for”.]
It’s discussed as a strategy in Stuart’s Human Compatible, though I don’t have a copy to reference the exact page number.
I also remember finding Rohin’s Reward Uncertainty to be a good summary of ~2018 CHAI thinking on this topic. There’s also a lot more academic work in this vein from other research groups/universities too.
The reason I’m not excited about this work is that (as Ryan and Fabien say) correctly specifying this distribution without solving ELK also seems really hard.
It’s clear that if you allow H to be “all possible harm predicates”, then an AI that acts conservatively wrt to this is safe, but it’s also going to be completely useless. Specifying the task of learning a good enough harm predicate distribution that both covers the “true” harms and also allows your AI to do things is quite hard, and subject to various kinds of terrible misspecification problems that seem not much easier to deal with than the case where you just try to learn a single harm predicate.
Solving this task (that is, solving the task spec of learning this harm predicate prosterior) via probabilistic inference also seems really hard from the “Bayesian ML is hard” perspective.
Ironically, the state of the art for this when I left CHAI in 2021 were “ask the most capable model (an LLM) to estimate the uncertainty for you in one go” and “ensemble the point estimates of very few but quite capable models” (that is ensembling, but common numbers were in the single digit range, e.g. 4). These seemed to out perform even the “learn a generative/contrastive model to get features, and then learn a bayesian logistic regression on top of it” approaches. (Anyone who’s currently at CHAI should feel free to correct me here.)
I think? that the approach from Bengio is trying to avoid the difficulties is by trying to solve Bayesian ML. I’m not super confident that he’ll do better than “finetune an LLM to help you do it”, which is presumably what we’d be doing anyways?
(That being said, my main objections are akin to the ontology misidentification problem in the ELK paper or Ryan’s comments above.)
The AI scientist idea does not bypass ELK, it has to face the difficulty head on. From the AI scientist blog post (emphasis mine):
One way to think of the AI Scientist is like a human scientist in the domain of pure physics, who never does any experiment. Such an AI reads a lot, in particular it knows about all the scientific litterature and any other kind of observational data, including about the experiments performed by humans in the world. From this, it deduces potential theories that are consistent with all these observations and experimental results. The theories it generates could be broken down into digestible pieces comparable to scientific papers, and we may be able to constrain it to express its theories in a human-understandable language (which includes natural language, scientific jargon, mathematics and programming languages). Such papers could be extremely useful if they allow to push the boundaries of scientific knowledge, especially in directions that matter to us, like healthcare, climate change or the UN SDGs.
If you get an AI scientist like that, clearly you need to have solved ELK. And I don’t think this would be a natural byproduct of the Bayesian approach (big Bayesian NNs don’t produce artifacts more legible than regular NNs, right?).
I agree this agenda wants to solve ELK by making legible predictive models.
But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.
I guess I don’t understand how it would even be possible to be conservative in the right way if you don’t solve ELK: just because the network is Bayesian doesn’t mean it can’t be scheming, and thus be conservative in the right way on the training distribution but fail catastrophically later, right? What difference can the training process and the overseer make between “the model is confident and correct that doing X is totally safe” (but humans don’t know why) and “the model tells you it’s confident that doing X is totally safe—but it’s actually false”? Where do you get the information that distinguishes “confident OOD prediction because it’s correct” from “the model is confidently lying OOD”? It feels like to solve this kind of issue, you ultimately have to rely on ELK (or abstain from making predictions in domains humans don’t understand, or ignore the possibility of scheming).
I’m skeptical this realistically improves much over doing normal ensembling (ensembling is basically just taking a small number of samples from the posterior anyway).
It could in principle be better, but this would require that estimating relatively low measure predictions is key (predictions which wouldn’t come up in a reasonably sized ensemble). I would guess that if you pause actions if there is any low measure hypothesis which indicates harm, you’ll be paralyzed with inaction (because the hypothesis “predict harm whenever a human wouldn’t understand what’s going on” isn’t going to be that low measure).
If it did improve over normal ensembling, why wouldn’t Bayesian ML already be relatively widely used?
If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
I think the reason Bayesian ML isn’t that widely used is because it’s intractable to do. So Bengio’s stuff would have to succesfully make it competitive with other methods.
[A bunch of what I’m going to say is maybe obvious, but I’m uncertain what will and won’t be obvious, so I’m saying it anyway.]
E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
We can already pick a threshold for a classifier trained “normally” and vary this threshold based on the context. (Or using an ensemble of classifiers.)
Within distribution, we should expect trained classifiers to be well calibrated. And, using online training, it will likely be possible to get good calibration on average despite the distribution shifting over time.
Of course, calibration can fail in rare cases (e.g. very strange OOD inputs) or if we want to generalize to some new distribution where we can’t obtain good labels and thus would prefer to just hope for good generalization.
For the exact ELK case we were talking about, we wanted to depend on generalization in this way (assuming we’re on the same page). More precisely, we wanted to train a reporter on a clean/easy set where we trust the labels, but the data is much easier than the data we actually care about and thus is OOD.
So, we might run into issues where our reporter fails to be calibrated (or even to have signal) off distribution and thus our classifier thresholds might not be very meaningful. (Of course, we can still do things like pay attention to the top 1% most harmful actions based on our reporter.)
Thus, you might hope you can solve these calibration issues with Baysian ML. But, bayesian ML also has no guarantees about avoiding rare failures or guarantees about having good generalization (guarantees here require imposing additional properties on the prior and the data). Further, as noted earlier, I think ensembling is basically just taking some number of samples of predictive models (hypotheses) from the posterior anyway.[1]
If we assume that ensembling is just taking some number of samples from the posterior, then we can get relatively tight bounds on the probability that bayesian ML would estimate (with no additional assumptions) using Hoeffding’s inequality.
For an ensemble of N models, these bounds won’t allow us to bound probabilities lower than 1/N, but we’ll probably run into other issues before these 1/N bounds are the main blocker. (As in, imagine the case where all of our ensemble reports a tiny probability of harm. We can bound the probability that Bayesian ML would report >1/N probability of harm. This is again by Hoeffding’s inequality. If we wanted to bound probabilities to <<1/N, then our ensemble won’t work, but I also expect that you run into other issues such as hypotheses like “say there will be harm if humans would have been confused” which ensure that bayesian ML always indicates some probability of harm in the cases we need to handle when using generalization.)
(Separately, I think the deep neural networks are themselves relatively well described as ensembles such that the variance of samples from the posterior will be low in practice. This makes ensembling as an approximation to bayesian ML even better.)
(Perhaps you (and other people who think Bayesian ML would do something useful) don’t buy this “training a model is just sampling from the posterior” argument? Or perhaps you buy this argument, but you think we care a lot about probabilities <<1/N and also that we can rule out bad generalizations which will always cause noise in probability estimates this low?)
I think the reason Bayesian ML isn’t that widely used is because it’s intractable to do. So Bengio’s stuff would have to succesfully make it competitive with other methods.
I think it’s probably tractable to apply Bayesian ML to learn moderately-small reporter heads on top of existing neural networks. And, in practice, I expect this probably gets you almost all of the advantages of doing Bayesian ML over the entire neural network.
For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It’s just Bayesian Logistic Regression).
(When viewed from this perspective, I think that training a linear probe using the difference in means method will be something like the maximum likelihood estimator for this posterior. And, difference in means works well empirically and has other well motivated reasons for working.)
My understanding is that training tiny (e.g. 2 layers with n_embed=512) models (transformers, MLPs, whatever) with Bayesian ML is also doable, so we could just use such a tiny model as a reporter head and get most of the benefits of Bayesian ML.
I think the reason we don’t see Bayesian ML is that ensembles probably just work fine in practice.
If we imagine that our prior is the space of (reporter) model initializations, then Bayesian ML will aim to approximate updating toward (reporter) models that performed well in training. Let’s simplify and imagine that instead Baysian ML is just approximating the distribution of initializations which get better training distribution performance than some threshold. We’ll refer to this distribution as a the posterior. I claim that training an ensemble of N (reporter) models from differential initializations is a very good approximation of sampling N (reporter) models from the posterior. I think this is basically a prediction of SLT and most other theories of how neural networks learn.
For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It’s just Bayesian Logistic Regression).
IIRC Adam Gleave tried this in summer of 2021 with one of Chinchilla/Gopher while he was interning at DeepMind, and this did not improve on ensembling for the tasks he considered.
It currently seems to me like almost all the interesting work is in the step where we need to know whether a hypothesis implies harm.
Putting this in language which makes the situation more clear to me, “we need to know whether a given predictative model predicts/‘thinks’ that a given action will result in bad outcomes (despite ourselves not knowing this and despite not having access to reliable short term measurements)”.
This is a very general version of the ELK problem.[1]
As far as I can, the proposal for ensuring that we know whether a given predictive model ‘thinks’ that a given action will result in bad outcomes is to ensure that the predictive models are sufficiently human interpretable.
(This is broadly equivalent to imitative generalization. Note that various worst-case-theory counterexamples to using imitative generalization are in the ELK report along with other discussion.)
Given that so much of the action in having human intepretable predictative models (again, these would be called hypotheses in the language of this post), I’d be excited about work demonstating this in hard cases where the most competitive approach would be to train a black box predictor.
For instance, I’d be excited about work demonstrating the ability to produce human interpretable predictive models which allow humans to perform much better at next token prediction while also understanding what’s going on.
While tasks like next token prediction are relatively straightforward to set up, the most analogous cases would probably be cases where we want to train an AI agent, but the agent could tamper with our measurements[2] and humans struggle to understand if the measurements were tampered with based on looking at the actions of the AI agent.
I’m somewhat nervous that in practice, work on this agenda will almost entirely focus on “reliable probabilistic inference” component even though this component doesn’t seem to be doing most of the work.
(If reliable probabilistic inference failed, but we had a methodology for finding some predictive hypotheses that humans understand, that would be a very valuable contribution. But probabilistic inference with asymptotic guarantees doesn’t clearly get us much on its own.)
[My thoughts here are partially based on reading some comments on this agenda I’ve seen elsewhere and a variety of other discussion with people, but I’m unsure if these individuals would like their name appear in this comment, so I haven’t included acknowledgements. All mistakes are my own.]
It’s a very general version of ELK as it needs to handle all bad outcomes, in principle including bad outcomes which take a very long time to manifest. (Unlike the diamond in the vault example from the ELK report where we just care about an outcome that occurs on a short time horizon.)
Note that the linked work (by myself and collaborators) doesn’t have the property that humans would have a hard knowing if the measurements were tampered with based on inspecting the action.
IIUC, I think that in addition to making predictive models more human interpretable, there’s another way this agenda aspires to get around the ELK problem.
Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.
IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the “right” notion of what constitutes a bad outcome, rather than a wrong one like “a human’s best guess would be that this outcome is bad”. I think Bengio’s proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the “right” notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.
EDIT: from a quick look at the ELK report, this idea (“ensembling”) is mentioned under “How we’d approach ELK in practice”. Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it’s plausible to me that this agenda could make progress on ELK even if it wasn’t specifically thinking about the problem.
If the claim is that it’s easier to learn a covering set for a “true” harm predicate and then act conservatively wrt the set, than to learn a single harm predicate, is not a new approach. E.g. just from CHAI:
The Inverse Reward Design paper, which tries to straightforwardly implement the posterior P(True Reward|Human specification of Reward) and act conservatively wrt to this outcome.
The Learning preferences from state of the world paper does this for P(True Reward | Initial state of the environment) and also acts conservatively wrt to outcome.
[A bunch of other papers which consider this for Reward | Another Source of Evidence, including: randomly sampled rewards and human off-switch pressing, . Also the CIRL paper, which proposes using uncertainty to directly solve the meta problem of “the thing this uncertainty is for”.]
It’s discussed as a strategy in Stuart’s Human Compatible, though I don’t have a copy to reference the exact page number.
I also remember finding Rohin’s Reward Uncertainty to be a good summary of ~2018 CHAI thinking on this topic. There’s also a lot more academic work in this vein from other research groups/universities too.
The reason I’m not excited about this work is that (as Ryan and Fabien say) correctly specifying this distribution without solving ELK also seems really hard.
It’s clear that if you allow H to be “all possible harm predicates”, then an AI that acts conservatively wrt to this is safe, but it’s also going to be completely useless. Specifying the task of learning a good enough harm predicate distribution that both covers the “true” harms and also allows your AI to do things is quite hard, and subject to various kinds of terrible misspecification problems that seem not much easier to deal with than the case where you just try to learn a single harm predicate.
Solving this task (that is, solving the task spec of learning this harm predicate prosterior) via probabilistic inference also seems really hard from the “Bayesian ML is hard” perspective.
Ironically, the state of the art for this when I left CHAI in 2021 were “ask the most capable model (an LLM) to estimate the uncertainty for you in one go” and “ensemble the point estimates of very few but quite capable models” (that is ensembling, but common numbers were in the single digit range, e.g. 4). These seemed to out perform even the “learn a generative/contrastive model to get features, and then learn a bayesian logistic regression on top of it” approaches. (Anyone who’s currently at CHAI should feel free to correct me here.)
I think? that the approach from Bengio is trying to avoid the difficulties is by trying to solve Bayesian ML. I’m not super confident that he’ll do better than “finetune an LLM to help you do it”, which is presumably what we’d be doing anyways?
(That being said, my main objections are akin to the ontology misidentification problem in the ELK paper or Ryan’s comments above.)
The AI scientist idea does not bypass ELK, it has to face the difficulty head on. From the AI scientist blog post (emphasis mine):
If you get an AI scientist like that, clearly you need to have solved ELK. And I don’t think this would be a natural byproduct of the Bayesian approach (big Bayesian NNs don’t produce artifacts more legible than regular NNs, right?).
I agree this agenda wants to solve ELK by making legible predictive models.
But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.
I guess I don’t understand how it would even be possible to be conservative in the right way if you don’t solve ELK: just because the network is Bayesian doesn’t mean it can’t be scheming, and thus be conservative in the right way on the training distribution but fail catastrophically later, right? What difference can the training process and the overseer make between “the model is confident and correct that doing X is totally safe” (but humans don’t know why) and “the model tells you it’s confident that doing X is totally safe—but it’s actually false”? Where do you get the information that distinguishes “confident OOD prediction because it’s correct” from “the model is confidently lying OOD”? It feels like to solve this kind of issue, you ultimately have to rely on ELK (or abstain from making predictions in domains humans don’t understand, or ignore the possibility of scheming).
I’m skeptical this realistically improves much over doing normal ensembling (ensembling is basically just taking a small number of samples from the posterior anyway).
It could in principle be better, but this would require that estimating relatively low measure predictions is key (predictions which wouldn’t come up in a reasonably sized ensemble). I would guess that if you pause actions if there is any low measure hypothesis which indicates harm, you’ll be paralyzed with inaction (because the hypothesis “predict harm whenever a human wouldn’t understand what’s going on” isn’t going to be that low measure).
If it did improve over normal ensembling, why wouldn’t Bayesian ML already be relatively widely used?
If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
I think the reason Bayesian ML isn’t that widely used is because it’s intractable to do. So Bengio’s stuff would have to succesfully make it competitive with other methods.
[A bunch of what I’m going to say is maybe obvious, but I’m uncertain what will and won’t be obvious, so I’m saying it anyway.]
We can already pick a threshold for a classifier trained “normally” and vary this threshold based on the context. (Or using an ensemble of classifiers.)
Within distribution, we should expect trained classifiers to be well calibrated. And, using online training, it will likely be possible to get good calibration on average despite the distribution shifting over time.
Of course, calibration can fail in rare cases (e.g. very strange OOD inputs) or if we want to generalize to some new distribution where we can’t obtain good labels and thus would prefer to just hope for good generalization.
For the exact ELK case we were talking about, we wanted to depend on generalization in this way (assuming we’re on the same page). More precisely, we wanted to train a reporter on a clean/easy set where we trust the labels, but the data is much easier than the data we actually care about and thus is OOD.
So, we might run into issues where our reporter fails to be calibrated (or even to have signal) off distribution and thus our classifier thresholds might not be very meaningful. (Of course, we can still do things like pay attention to the top 1% most harmful actions based on our reporter.)
Thus, you might hope you can solve these calibration issues with Baysian ML. But, bayesian ML also has no guarantees about avoiding rare failures or guarantees about having good generalization (guarantees here require imposing additional properties on the prior and the data). Further, as noted earlier, I think ensembling is basically just taking some number of samples of predictive models (hypotheses) from the posterior anyway.[1]
If we assume that ensembling is just taking some number of samples from the posterior, then we can get relatively tight bounds on the probability that bayesian ML would estimate (with no additional assumptions) using Hoeffding’s inequality.
For an ensemble of N models, these bounds won’t allow us to bound probabilities lower than 1/N, but we’ll probably run into other issues before these 1/N bounds are the main blocker. (As in, imagine the case where all of our ensemble reports a tiny probability of harm. We can bound the probability that Bayesian ML would report >1/N probability of harm. This is again by Hoeffding’s inequality. If we wanted to bound probabilities to <<1/N, then our ensemble won’t work, but I also expect that you run into other issues such as hypotheses like “say there will be harm if humans would have been confused” which ensure that bayesian ML always indicates some probability of harm in the cases we need to handle when using generalization.)
(Separately, I think the deep neural networks are themselves relatively well described as ensembles such that the variance of samples from the posterior will be low in practice. This makes ensembling as an approximation to bayesian ML even better.)
(Perhaps you (and other people who think Bayesian ML would do something useful) don’t buy this “training a model is just sampling from the posterior” argument? Or perhaps you buy this argument, but you think we care a lot about probabilities <<1/N and also that we can rule out bad generalizations which will always cause noise in probability estimates this low?)
I think it’s probably tractable to apply Bayesian ML to learn moderately-small reporter heads on top of existing neural networks. And, in practice, I expect this probably gets you almost all of the advantages of doing Bayesian ML over the entire neural network.
For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It’s just Bayesian Logistic Regression).
(When viewed from this perspective, I think that training a linear probe using the difference in means method will be something like the maximum likelihood estimator for this posterior. And, difference in means works well empirically and has other well motivated reasons for working.)
My understanding is that training tiny (e.g. 2 layers with n_embed=512) models (transformers, MLPs, whatever) with Bayesian ML is also doable, so we could just use such a tiny model as a reporter head and get most of the benefits of Bayesian ML.
I think the reason we don’t see Bayesian ML is that ensembles probably just work fine in practice.
If we imagine that our prior is the space of (reporter) model initializations, then Bayesian ML will aim to approximate updating toward (reporter) models that performed well in training. Let’s simplify and imagine that instead Baysian ML is just approximating the distribution of initializations which get better training distribution performance than some threshold. We’ll refer to this distribution as a the posterior. I claim that training an ensemble of N (reporter) models from differential initializations is a very good approximation of sampling N (reporter) models from the posterior. I think this is basically a prediction of SLT and most other theories of how neural networks learn.
IIRC Adam Gleave tried this in summer of 2021 with one of Chinchilla/Gopher while he was interning at DeepMind, and this did not improve on ensembling for the tasks he considered.