I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.
We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situate our work.
I hope to write up a reasonable pitch sometime over the next few weeks.
In the original document we also mention a non-ELK application, namely using a heuristic estimator for adversarial training, which is significantly more straightforward. I think this is helpful for validating the intuitive story that heuristic estimators would overcome limitations of black box training, and in some sense I think that ELK and together are the two halves of the alignment problem and so solving both is very exciting. That said, I’ve considered this in less detail than the ELK application. I’ll try to give a bit more detail on this in the child comment.
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough time (but which is too expensive to continuously oversee your model at inference time), as in iterated amplification. You might also care if you’ve gotten a specification by solving ELK, but want to train a model that never does something bad (rather than merely detecting bad behavior at runtime, when pausing could be catastrophic). In general I think that “we can detect bad behavior but the model does a treacherous turn anyway” is a plausible failure mode to address.
A heuristic estimator lets you asses the probability that a given model M violates C for any distribution D, i.e. Px∼D[C(x,M(x))]. You can produce estimates even when (i) the probability is very small, (ii) you can’t efficiently draw samples from D.
So we search for a distribution D on which M is particularly likely to behave catastrophically, estimate the log of the catastrophe probability under D, and then take a gradient descent step on M to reduce that probability (assuming that the estimator is differentiable, which is true for examples we have considered and appears to follow from natural coherence properties). This approach can hope to achieve error probabilities much lower than (1 / training set size), and we can ask about how M would behave given observations that we can recognize but can’t synthesize in the lab (which would otherwise be natural triggers for takeover attempts). In theory this overcomes the big limitations for adversarial training.
If you actually had a heuristic estimator you could immediately test this application. Detecting small catastrophe probabilities is particularly straightforward. In this setting D can be a neural net adversary—you can initialize with an LM asked to produce cases where M behaves badly, and then fine-tune D to optimize the catastrophe probability.
I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.
We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situate our work.
I hope to write up a reasonable pitch sometime over the next few weeks.
In the original document we also mention a non-ELK application, namely using a heuristic estimator for adversarial training, which is significantly more straightforward. I think this is helpful for validating the intuitive story that heuristic estimators would overcome limitations of black box training, and in some sense I think that ELK and together are the two halves of the alignment problem and so solving both is very exciting. That said, I’ve considered this in less detail than the ELK application. I’ll try to give a bit more detail on this in the child comment.
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough time (but which is too expensive to continuously oversee your model at inference time), as in iterated amplification. You might also care if you’ve gotten a specification by solving ELK, but want to train a model that never does something bad (rather than merely detecting bad behavior at runtime, when pausing could be catastrophic). In general I think that “we can detect bad behavior but the model does a treacherous turn anyway” is a plausible failure mode to address.
A heuristic estimator lets you asses the probability that a given model M violates C for any distribution D, i.e. Px∼D[C(x,M(x))]. You can produce estimates even when (i) the probability is very small, (ii) you can’t efficiently draw samples from D.
So we search for a distribution D on which M is particularly likely to behave catastrophically, estimate the log of the catastrophe probability under D, and then take a gradient descent step on M to reduce that probability (assuming that the estimator is differentiable, which is true for examples we have considered and appears to follow from natural coherence properties). This approach can hope to achieve error probabilities much lower than (1 / training set size), and we can ask about how M would behave given observations that we can recognize but can’t synthesize in the lab (which would otherwise be natural triggers for takeover attempts). In theory this overcomes the big limitations for adversarial training.
If you actually had a heuristic estimator you could immediately test this application. Detecting small catastrophe probabilities is particularly straightforward. In this setting D can be a neural net adversary—you can initialize with an LM asked to produce cases where M behaves badly, and then fine-tune D to optimize the catastrophe probability.