I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven’t been very motivated to engage much with ARC’s recent work). But I decided maybe it’s best to comment in a way that gives a better signal than silence.
I’ve generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that.
Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.
I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.
I can try to explain a little more: It seemed odd that the “potential” applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered—before writing the paper—questions like “OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?” Some of what was said in the paper was fairly vague stuff like:
If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible.
In my opinion, it’s also important to bear in mind that the criteria of a problem being ‘open’ is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like: That the solution would seem to require new insights into X and therefore a proof would ‘have to be’ deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc. Perhaps more of these things need to be made explicit in order to argue more effectively that ARC’s stating of these open problems about heuristic estimators is an interesting contribution in itself?
To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I’m saying:
Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.
But practically it means that when I ask myself something like: ‘Why would I drop whatever else I’m working on and work on this stuff?’ I find it quite hard to answer in a way that’s not basically just all deference to some ‘vision’ that is currently undeclared (or as the paper says “mostly defer[red]” to “future articles”).
Having said all this I’ll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.
I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.
We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situate our work.
I hope to write up a reasonable pitch sometime over the next few weeks.
In the original document we also mention a non-ELK application, namely using a heuristic estimator for adversarial training, which is significantly more straightforward. I think this is helpful for validating the intuitive story that heuristic estimators would overcome limitations of black box training, and in some sense I think that ELK and together are the two halves of the alignment problem and so solving both is very exciting. That said, I’ve considered this in less detail than the ELK application. I’ll try to give a bit more detail on this in the child comment.
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough time (but which is too expensive to continuously oversee your model at inference time), as in iterated amplification. You might also care if you’ve gotten a specification by solving ELK, but want to train a model that never does something bad (rather than merely detecting bad behavior at runtime, when pausing could be catastrophic). In general I think that “we can detect bad behavior but the model does a treacherous turn anyway” is a plausible failure mode to address.
A heuristic estimator lets you asses the probability that a given model M violates C for any distribution D, i.e. Px∼D[C(x,M(x))]. You can produce estimates even when (i) the probability is very small, (ii) you can’t efficiently draw samples from D.
So we search for a distribution D on which M is particularly likely to behave catastrophically, estimate the log of the catastrophe probability under D, and then take a gradient descent step on M to reduce that probability (assuming that the estimator is differentiable, which is true for examples we have considered and appears to follow from natural coherence properties). This approach can hope to achieve error probabilities much lower than (1 / training set size), and we can ask about how M would behave given observations that we can recognize but can’t synthesize in the lab (which would otherwise be natural triggers for takeover attempts). In theory this overcomes the big limitations for adversarial training.
If you actually had a heuristic estimator you could immediately test this application. Detecting small catastrophe probabilities is particularly straightforward. In this setting D can be a neural net adversary—you can initialize with an LM asked to produce cases where M behaves badly, and then fine-tune D to optimize the catastrophe probability.
I can’t say anything rigorous, sophisticated, or credible. I can just say that the paper was a very welcome spigot of energy and optimism in my own model of why “formal verification” -style assurances and QA demands are ill-suited to models (either behavioral evals or reasoning about the output of decompilers).
I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven’t been very motivated to engage much with ARC’s recent work). But I decided maybe it’s best to comment in a way that gives a better signal than silence.
I’ve generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that.
Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.
I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.
I can try to explain a little more: It seemed odd that the “potential” applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered—before writing the paper—questions like “OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?” Some of what was said in the paper was fairly vague stuff like:
In my opinion, it’s also important to bear in mind that the criteria of a problem being ‘open’ is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like: That the solution would seem to require new insights into X and therefore a proof would ‘have to be’ deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc. Perhaps more of these things need to be made explicit in order to argue more effectively that ARC’s stating of these open problems about heuristic estimators is an interesting contribution in itself?
To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I’m saying:
But practically it means that when I ask myself something like: ‘Why would I drop whatever else I’m working on and work on this stuff?’ I find it quite hard to answer in a way that’s not basically just all deference to some ‘vision’ that is currently undeclared (or as the paper says “mostly defer[red]” to “future articles”).
Having said all this I’ll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.
I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.
We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situate our work.
I hope to write up a reasonable pitch sometime over the next few weeks.
In the original document we also mention a non-ELK application, namely using a heuristic estimator for adversarial training, which is significantly more straightforward. I think this is helpful for validating the intuitive story that heuristic estimators would overcome limitations of black box training, and in some sense I think that ELK and together are the two halves of the alignment problem and so solving both is very exciting. That said, I’ve considered this in less detail than the ELK application. I’ll try to give a bit more detail on this in the child comment.
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough time (but which is too expensive to continuously oversee your model at inference time), as in iterated amplification. You might also care if you’ve gotten a specification by solving ELK, but want to train a model that never does something bad (rather than merely detecting bad behavior at runtime, when pausing could be catastrophic). In general I think that “we can detect bad behavior but the model does a treacherous turn anyway” is a plausible failure mode to address.
A heuristic estimator lets you asses the probability that a given model M violates C for any distribution D, i.e. Px∼D[C(x,M(x))]. You can produce estimates even when (i) the probability is very small, (ii) you can’t efficiently draw samples from D.
So we search for a distribution D on which M is particularly likely to behave catastrophically, estimate the log of the catastrophe probability under D, and then take a gradient descent step on M to reduce that probability (assuming that the estimator is differentiable, which is true for examples we have considered and appears to follow from natural coherence properties). This approach can hope to achieve error probabilities much lower than (1 / training set size), and we can ask about how M would behave given observations that we can recognize but can’t synthesize in the lab (which would otherwise be natural triggers for takeover attempts). In theory this overcomes the big limitations for adversarial training.
If you actually had a heuristic estimator you could immediately test this application. Detecting small catastrophe probabilities is particularly straightforward. In this setting D can be a neural net adversary—you can initialize with an LM asked to produce cases where M behaves badly, and then fine-tune D to optimize the catastrophe probability.
Have you seen https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/ and any of the other recent posts on https://www.alignment.org/blog/? I don’t think they make it obvious that formalizing the presumption of independence would lead to alignment solutions, but they do give a much more detailed explanation of why you might hope so than the paper.
I can’t say anything rigorous, sophisticated, or credible. I can just say that the paper was a very welcome spigot of energy and optimism in my own model of why “formal verification” -style assurances and QA demands are ill-suited to models (either behavioral evals or reasoning about the output of decompilers).