At a high level the case for Bayesian statistics in alignment is that if you want to control engineering systems that are learned rather than designed, and if that learning means choosing parameters that have high probability with respect to some choice of dataset and model, then it makes sense to understand what the basic structure of that kind of Bayesian learning is
[...]
I claim that this basic structure is not yet well-understood, that it is nonetheless possible to make fundamental progress on understanding it at both a theoretical and empirical level, and that this understanding will be useful for alignment.
I think I start from a position which is more skeptical than you about the value of improving understanding in general. And also a position of more skepticism about working on things which are closer to fundamental science without more clear theories of impact. (Fundamental science as opposed to having a more clear and straightforward path into the plan for making AI go well.)
This probably explains a bunch of our difference in views. (And this disagreement is probably hard to dig into given that it depends on a bunch of relatively messy heuristics and various views about how progress in deep learning typically happens.)
I don’t think fundamental science style theories of change are an unreasonable thing to work on (particularly given the capacity for huge speed ups from AI automation), I just seem to be more skeptical of this type of work than you appear to be.
I think that’s right, in the sense that this explains a large fraction of our difference in views.
I’m a mathematician, so I suppose in my cosmology we’ve already travelled 99% of the distance from the upper reaches of the theory stratosphere to the ground and the remaining distance doesn’t seem like such an obstacle, but it’s fair to say that the proof is in the pudding and the pudding has yet to arrive.
If SLT were to say nontrivial things about what instruction fine-tuning and RLHF are doing to models, and those things were verified in experiments, would that shift your skepticism?
However, we think that absent substantial advances in science, we’re unlikely to develop approaches which substantially improve safety-in-practice beyond baseline methods (e.g., training with RLHF and applying coup probes) without the improvement being captured by black-box control evaluations. We might discuss and argue for this in more detail in a follow-up post.
Could you explain why you are skeptical that current baseline methods can be dramatically improved? It seems possible to me that the major shortcomings of instruction fine-tuning and RLHF (that they seem to make shallow changes to representations and computation) are not fundamental. Maybe it’s naive because I haven’t thought about this very hard, but from our point of view representations “mature” over development and become rather rigid; however, maybe there’s something like Yamanaka factors!
Even from the perspective of black-box control, it seems that as a practical matter one could extract more useful work if the thing in the box is more aligned, and thus it seems you would agree that fundamental advantages in these baseline methods would be welcome.
Incidentally, I don’t really understand what you mean by “captured by black-box control evaluations”. Was there a follow-up?
If SLT were to say nontrivial things about what instruction fine-tuning and RLHF are doing to models, and those things were verified in experiments, would that shift your skepticism?
If SLT results in interesting predictions in some case or was generally able to notably improve our ability to predict downstream behavior (including just within a subset of cases), that would be a large update from me. Seems like a reasonable goal to shoot for.
Could you explain why you are skeptical that current baseline methods can be dramatically improved?
Note the “absent substantial advances in science”! My perspective was[1] that it is pretty likely that for white box methods (methods that go beyond just training and black box behavior), the most simple and baseline approaches are reasonably likely to be pretty close to as good as you can get for reducing risk if you are just doing basic empirical monkeying. At least for risks from close-to-worst-case-scheming models, I think other threat models might be more amenable to complex white-box methods as they are easier to directly study.
(Threat models other than close-to-worst-case-scheming have the property that if they exist, you should be able to directly measure them and potentially you can measure how well the method worked on the model itself in a very analogous test bed; see discussion here for details. More minimally, it is easier to construct many empirical test beds if you aren’t worried that the model is directly trying to undermine your ability to study it.)
Why think this? The core reason is that is hard to build good empirical test beds for techniques using the internals of models in non-trivial ways. We don’t have the internals of a scheming model in front of us. We can certainly study behavioral interventions (black-box control) and we can try to make test beds which are as analogous as possible. But, it seems hard to get that analogous for a variety of reasons. You could demonstrate robust properties of models such that you can sufficiently robustly predict a property of scheming models to take advantage of. But, now we’re probably talking about fundamental advances in science.
I don’t think I’ve really adequately defended this perspective here, but this is at least roughly why I think this.
Incidentally, I don’t really understand what you mean by “captured by black-box control evaluations”. Was there a follow-up?
Black-box control evaluations are evaluations where you basically assume your model is worst case misaligned, you make conservative assumptions about inductive biases (e.g. you get nothing other than what the model is forced to do), and you try to see if the model is capable enough to cause problems even given your countermeasures.
Basically, in any place where you can’t establish a property with a capability evaluation, you make conservative/pessimal assumptions.
(I usually imagine some wiggle room, e.g., we need to study sample efficiency and exploration hacking so that we can argue about the effects of training and elicitation. We’d also want to answer some questions which are a bit tricky to answer with capability evaluations, but this is at least conceivable, e.g., how good is the AI at coming up with good strategies for subverting safeguards within a forward pass.)
First, I think this class of work is critical for deconfusion, which is critical if we need a theory for far more powerful AI systems, rather than for very smart but still fundamentally human level systems.
Secondly, concretely, it seems that very few other approaches to safety have the potential to provide enough fundamental understanding to allow us to make strong statements about models before they are fully trained. This seems like a critical issue if we are concerned about very strong models that could pose risks during testing, or possibly even during training. And as far as I’m aware, nothing in the interpretability and auditing spaces has a real claim to be able to make clear statements about those risks, other than perhaps to suggest interim testing during model training—which could work, if a huge amount of such work is done, but seems very unlikely to happen.
Edit to add: Given the votes on this, what specifically do people disagree with?
I don’t strongly disagree but do weakly disagree on some points so I guess I’ll answer
Re first- if you buy into automated alignment work by human level AGI, then trying to align ASI now seems less worth it. The strongest counterargument to this I see is that “human level AGI” is impossible to get with our current understanding, as it will be superhuman in some things and weirdly bad at others.
Re second- disagreements might be nitpicking on “few other approaches” vs “few currently pursued approaches”. There are probably a bunch of things that would allow fundamental understanding if they panned out (various agent foundations agendas, probably safe ai agendas like davidad’s), though one can argue they won’t apply to deep learning or are less promising to explore than SLT
In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we’ll be able to do automated alignment of ASI, you’ll still need some reliable approach to “manual” alignment of current systems. We’re already far past the point where we can robustly verify LLMs claims’ or reasoning in a robust fashion outside of narrow domains like programming and math.
But on point two, I strongly agree that Agent foundations and Davidad’s agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad’s ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that’s basically it. And MIRI abandoned agent foundations, while Openphil isn’t, it seems, putting money or effort into them.
I think I start from a position which is more skeptical than you about the value of improving understanding in general. And also a position of more skepticism about working on things which are closer to fundamental science without more clear theories of impact. (Fundamental science as opposed to having a more clear and straightforward path into the plan for making AI go well.)
This probably explains a bunch of our difference in views. (And this disagreement is probably hard to dig into given that it depends on a bunch of relatively messy heuristics and various views about how progress in deep learning typically happens.)
I don’t think fundamental science style theories of change are an unreasonable thing to work on (particularly given the capacity for huge speed ups from AI automation), I just seem to be more skeptical of this type of work than you appear to be.
I think that’s right, in the sense that this explains a large fraction of our difference in views.
I’m a mathematician, so I suppose in my cosmology we’ve already travelled 99% of the distance from the upper reaches of the theory stratosphere to the ground and the remaining distance doesn’t seem like such an obstacle, but it’s fair to say that the proof is in the pudding and the pudding has yet to arrive.
If SLT were to say nontrivial things about what instruction fine-tuning and RLHF are doing to models, and those things were verified in experiments, would that shift your skepticism?
I’ve been reading some of your other writing:
Could you explain why you are skeptical that current baseline methods can be dramatically improved? It seems possible to me that the major shortcomings of instruction fine-tuning and RLHF (that they seem to make shallow changes to representations and computation) are not fundamental. Maybe it’s naive because I haven’t thought about this very hard, but from our point of view representations “mature” over development and become rather rigid; however, maybe there’s something like Yamanaka factors!
Even from the perspective of black-box control, it seems that as a practical matter one could extract more useful work if the thing in the box is more aligned, and thus it seems you would agree that fundamental advantages in these baseline methods would be welcome.
Incidentally, I don’t really understand what you mean by “captured by black-box control evaluations”. Was there a follow-up?
(Oops, slow reply)
If SLT results in interesting predictions in some case or was generally able to notably improve our ability to predict downstream behavior (including just within a subset of cases), that would be a large update from me. Seems like a reasonable goal to shoot for.
Note the “absent substantial advances in science”! My perspective was[1] that it is pretty likely that for white box methods (methods that go beyond just training and black box behavior), the most simple and baseline approaches are reasonably likely to be pretty close to as good as you can get for reducing risk if you are just doing basic empirical monkeying. At least for risks from close-to-worst-case-scheming models, I think other threat models might be more amenable to complex white-box methods as they are easier to directly study.
(Threat models other than close-to-worst-case-scheming have the property that if they exist, you should be able to directly measure them and potentially you can measure how well the method worked on the model itself in a very analogous test bed; see discussion here for details. More minimally, it is easier to construct many empirical test beds if you aren’t worried that the model is directly trying to undermine your ability to study it.)
Why think this? The core reason is that is hard to build good empirical test beds for techniques using the internals of models in non-trivial ways. We don’t have the internals of a scheming model in front of us. We can certainly study behavioral interventions (black-box control) and we can try to make test beds which are as analogous as possible. But, it seems hard to get that analogous for a variety of reasons. You could demonstrate robust properties of models such that you can sufficiently robustly predict a property of scheming models to take advantage of. But, now we’re probably talking about fundamental advances in science.
I don’t think I’ve really adequately defended this perspective here, but this is at least roughly why I think this.
Black-box control evaluations are evaluations where you basically assume your model is worst case misaligned, you make conservative assumptions about inductive biases (e.g. you get nothing other than what the model is forced to do), and you try to see if the model is capable enough to cause problems even given your countermeasures.
Basically, in any place where you can’t establish a property with a capability evaluation, you make conservative/pessimal assumptions.
(I usually imagine some wiggle room, e.g., we need to study sample efficiency and exploration hacking so that we can argue about the effects of training and elicitation. We’d also want to answer some questions which are a bit tricky to answer with capability evaluations, but this is at least conceivable, e.g., how good is the AI at coming up with good strategies for subverting safeguards within a forward pass.)
I’ve updated somewhat from this position, partially based on latent adversarial training and also just after thinking about it more.
First, I think this class of work is critical for deconfusion, which is critical if we need a theory for far more powerful AI systems, rather than for very smart but still fundamentally human level systems.
Secondly, concretely, it seems that very few other approaches to safety have the potential to provide enough fundamental understanding to allow us to make strong statements about models before they are fully trained. This seems like a critical issue if we are concerned about very strong models that could pose risks during testing, or possibly even during training. And as far as I’m aware, nothing in the interpretability and auditing spaces has a real claim to be able to make clear statements about those risks, other than perhaps to suggest interim testing during model training—which could work, if a huge amount of such work is done, but seems very unlikely to happen.
Edit to add: Given the votes on this, what specifically do people disagree with?
I don’t strongly disagree but do weakly disagree on some points so I guess I’ll answer
Re first- if you buy into automated alignment work by human level AGI, then trying to align ASI now seems less worth it. The strongest counterargument to this I see is that “human level AGI” is impossible to get with our current understanding, as it will be superhuman in some things and weirdly bad at others.
Re second- disagreements might be nitpicking on “few other approaches” vs “few currently pursued approaches”. There are probably a bunch of things that would allow fundamental understanding if they panned out (various agent foundations agendas, probably safe ai agendas like davidad’s), though one can argue they won’t apply to deep learning or are less promising to explore than SLT
In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we’ll be able to do automated alignment of ASI, you’ll still need some reliable approach to “manual” alignment of current systems. We’re already far past the point where we can robustly verify LLMs claims’ or reasoning in a robust fashion outside of narrow domains like programming and math.
But on point two, I strongly agree that Agent foundations and Davidad’s agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad’s ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that’s basically it. And MIRI abandoned agent foundations, while Openphil isn’t, it seems, putting money or effort into them.