While I agree that it would be great to optimize worst-case performance, all of these techniques feel quite difficult to do scalably and with guarantees. With adversarial training, you need to find all of the ways that an agent could fail, while catastrophe could happen if the agent stumbles across any of these methods. It seems plausible to me that with sufficient additional information given to the adversary we can meet this standard, but it seems very hard to knowably meet this standard, i.e. to have a strong argument that we will find all of the potential issues.
With verification, the specification problem seems like a deal-breaker, unless combined with other methods: a major point with AI safety is that we can’t write down a good specification for what we want. If we instead only use verification to propagate worst-case guarantees from one model to another (as the post suggests), then it seems possible in theory, but very expensive in practice: most verification techniques assume unlimited fast access to the specification, whereas our specification is very expensive to query. Of course, not much research has focused on this setting, so we can expect large gains; nonetheless you do need to somehow extrapolate the specification to all possible inputs, which seems hard to do with limited access to the specification.
Transparency also seems like it provides additional safety, rather than making any guarantees, since we probably can’t get a guarantee that our transparency mechanisms can show us all possible failure modes in a way that we understand them. The argument that we can only focus on the training data makes the assumption that the AI system is not going to generalize well outside of the training dataset. While I’m sympathetic to this assumption (we really don’t have good methods for generalization, and there are strong reasons to expect generalization to be near-impossible), it isn’t one that I’m confident about, especially when we’re talking about general intelligence.
Of course, I’m still excited for more research to be done on these topics, since they do seem to cut out some additional failure modes. But I get the sense that you’re looking to have a semi-formal strong argument that we will have good worst-case performance, and I don’t see the reasons for optimism about that.
most verification techniques assume unlimited fast access to the specification
Making the specification faster than the model doesn’t really help you. In this case the specification is a somewhat more expensive than the model itself, but as far as I can tell that should just make verification somewhat more expensive.
That’s fair. I suspect I was assuming a really fast model. Are you imagining that both the model and the specification are quite fast, or that both are relatively slow?
I think my intuition is something like: the number of queries to the model / specification required to obtain worst-case guarantees is orders of magnitude more than the number of queries needed to train the model, and this ratio gets worse the more complex your environment is. (Though if the training distribution is large enough such that it “covers everything the environment can do”, the intuition becomes weaker, mostly because this makes the cost of training much higher.)
the number of queries to the model / specification required to obtain worst-case guarantees is orders of magnitude more than the number of queries needed to train the model, and this ratio gets worse the more complex your environment is
Not clear to me whether this is true in general—if the property you are specifying is in some sense “easy” to satisfy (e.g. it holds of the random model, holds for some model near any given model), and the behavior you are training is “hard” (e.g. requires almost all of the model’s capacity) then it seems possible that verification won’t add too much.
The argument that we can only focus on the training data makes the assumption that the AI system is not going to generalize well outside of the training dataset.
I’m not intending to make this assumption. The claim is: parts of your model that exhibit intelligence need to do something on the training distribution, because “optimize to perform well on the training distribution” is the only mechanism that makes the model intelligent.
That makes sense and rereading the post the transparency section is clearer now, thanks! If I had to guess what gave me the wrong impression before, it would be this part:
its behavior can only be intelligent when it is exercised on the training distribution
I suspect when I read this, I thought it implied “when it is not on the training distribution, its behavior cannot be intelligent”.
While I agree that it would be great to optimize worst-case performance, all of these techniques feel quite difficult to do scalably and with guarantees. With adversarial training, you need to find all of the ways that an agent could fail, while catastrophe could happen if the agent stumbles across any of these methods. It seems plausible to me that with sufficient additional information given to the adversary we can meet this standard, but it seems very hard to knowably meet this standard, i.e. to have a strong argument that we will find all of the potential issues.
With verification, the specification problem seems like a deal-breaker, unless combined with other methods: a major point with AI safety is that we can’t write down a good specification for what we want. If we instead only use verification to propagate worst-case guarantees from one model to another (as the post suggests), then it seems possible in theory, but very expensive in practice: most verification techniques assume unlimited fast access to the specification, whereas our specification is very expensive to query. Of course, not much research has focused on this setting, so we can expect large gains; nonetheless you do need to somehow extrapolate the specification to all possible inputs, which seems hard to do with limited access to the specification.
Transparency also seems like it provides additional safety, rather than making any guarantees, since we probably can’t get a guarantee that our transparency mechanisms can show us all possible failure modes in a way that we understand them. The argument that we can only focus on the training data makes the assumption that the AI system is not going to generalize well outside of the training dataset. While I’m sympathetic to this assumption (we really don’t have good methods for generalization, and there are strong reasons to expect generalization to be near-impossible), it isn’t one that I’m confident about, especially when we’re talking about general intelligence.
Of course, I’m still excited for more research to be done on these topics, since they do seem to cut out some additional failure modes. But I get the sense that you’re looking to have a semi-formal strong argument that we will have good worst-case performance, and I don’t see the reasons for optimism about that.
Making the specification faster than the model doesn’t really help you. In this case the specification is a somewhat more expensive than the model itself, but as far as I can tell that should just make verification somewhat more expensive.
That’s fair. I suspect I was assuming a really fast model. Are you imagining that both the model and the specification are quite fast, or that both are relatively slow?
I think my intuition is something like: the number of queries to the model / specification required to obtain worst-case guarantees is orders of magnitude more than the number of queries needed to train the model, and this ratio gets worse the more complex your environment is. (Though if the training distribution is large enough such that it “covers everything the environment can do”, the intuition becomes weaker, mostly because this makes the cost of training much higher.)
Not clear to me whether this is true in general—if the property you are specifying is in some sense “easy” to satisfy (e.g. it holds of the random model, holds for some model near any given model), and the behavior you are training is “hard” (e.g. requires almost all of the model’s capacity) then it seems possible that verification won’t add too much.
Yeah, I agree it is not always true and so isn’t a clear obstruction.
I’m not intending to make this assumption. The claim is: parts of your model that exhibit intelligence need to do something on the training distribution, because “optimize to perform well on the training distribution” is the only mechanism that makes the model intelligent.
That makes sense and rereading the post the transparency section is clearer now, thanks! If I had to guess what gave me the wrong impression before, it would be this part:
I suspect when I read this, I thought it implied “when it is not on the training distribution, its behavior cannot be intelligent”.
I also had trouble understanding that sub-clause. Maybe we read it in our head with the wrong emphasis:
Meaning: The agent gets inputs that are within the training distribution. ↔ The agent behaves intelligently.
But I guess it’s supposed to be:
Meaning: A behaviour is intelligent. ↔ The behaviour was exercised during training on the training distribution.