If SLT were to say nontrivial things about what instruction fine-tuning and RLHF are doing to models, and those things were verified in experiments, would that shift your skepticism?
If SLT results in interesting predictions in some case or was generally able to notably improve our ability to predict downstream behavior (including just within a subset of cases), that would be a large update from me. Seems like a reasonable goal to shoot for.
Could you explain why you are skeptical that current baseline methods can be dramatically improved?
Note the “absent substantial advances in science”! My perspective was[1] that it is pretty likely that for white box methods (methods that go beyond just training and black box behavior), the most simple and baseline approaches are reasonably likely to be pretty close to as good as you can get for reducing risk if you are just doing basic empirical monkeying. At least for risks from close-to-worst-case-scheming models, I think other threat models might be more amenable to complex white-box methods as they are easier to directly study.
(Threat models other than close-to-worst-case-scheming have the property that if they exist, you should be able to directly measure them and potentially you can measure how well the method worked on the model itself in a very analogous test bed; see discussion here for details. More minimally, it is easier to construct many empirical test beds if you aren’t worried that the model is directly trying to undermine your ability to study it.)
Why think this? The core reason is that is hard to build good empirical test beds for techniques using the internals of models in non-trivial ways. We don’t have the internals of a scheming model in front of us. We can certainly study behavioral interventions (black-box control) and we can try to make test beds which are as analogous as possible. But, it seems hard to get that analogous for a variety of reasons. You could demonstrate robust properties of models such that you can sufficiently robustly predict a property of scheming models to take advantage of. But, now we’re probably talking about fundamental advances in science.
I don’t think I’ve really adequately defended this perspective here, but this is at least roughly why I think this.
Incidentally, I don’t really understand what you mean by “captured by black-box control evaluations”. Was there a follow-up?
Black-box control evaluations are evaluations where you basically assume your model is worst case misaligned, you make conservative assumptions about inductive biases (e.g. you get nothing other than what the model is forced to do), and you try to see if the model is capable enough to cause problems even given your countermeasures.
Basically, in any place where you can’t establish a property with a capability evaluation, you make conservative/pessimal assumptions.
(I usually imagine some wiggle room, e.g., we need to study sample efficiency and exploration hacking so that we can argue about the effects of training and elicitation. We’d also want to answer some questions which are a bit tricky to answer with capability evaluations, but this is at least conceivable, e.g., how good is the AI at coming up with good strategies for subverting safeguards within a forward pass.)
(Oops, slow reply)
If SLT results in interesting predictions in some case or was generally able to notably improve our ability to predict downstream behavior (including just within a subset of cases), that would be a large update from me. Seems like a reasonable goal to shoot for.
Note the “absent substantial advances in science”! My perspective was[1] that it is pretty likely that for white box methods (methods that go beyond just training and black box behavior), the most simple and baseline approaches are reasonably likely to be pretty close to as good as you can get for reducing risk if you are just doing basic empirical monkeying. At least for risks from close-to-worst-case-scheming models, I think other threat models might be more amenable to complex white-box methods as they are easier to directly study.
(Threat models other than close-to-worst-case-scheming have the property that if they exist, you should be able to directly measure them and potentially you can measure how well the method worked on the model itself in a very analogous test bed; see discussion here for details. More minimally, it is easier to construct many empirical test beds if you aren’t worried that the model is directly trying to undermine your ability to study it.)
Why think this? The core reason is that is hard to build good empirical test beds for techniques using the internals of models in non-trivial ways. We don’t have the internals of a scheming model in front of us. We can certainly study behavioral interventions (black-box control) and we can try to make test beds which are as analogous as possible. But, it seems hard to get that analogous for a variety of reasons. You could demonstrate robust properties of models such that you can sufficiently robustly predict a property of scheming models to take advantage of. But, now we’re probably talking about fundamental advances in science.
I don’t think I’ve really adequately defended this perspective here, but this is at least roughly why I think this.
Black-box control evaluations are evaluations where you basically assume your model is worst case misaligned, you make conservative assumptions about inductive biases (e.g. you get nothing other than what the model is forced to do), and you try to see if the model is capable enough to cause problems even given your countermeasures.
Basically, in any place where you can’t establish a property with a capability evaluation, you make conservative/pessimal assumptions.
(I usually imagine some wiggle room, e.g., we need to study sample efficiency and exploration hacking so that we can argue about the effects of training and elicitation. We’d also want to answer some questions which are a bit tricky to answer with capability evaluations, but this is at least conceivable, e.g., how good is the AI at coming up with good strategies for subverting safeguards within a forward pass.)
I’ve updated somewhat from this position, partially based on latent adversarial training and also just after thinking about it more.