The three directions of reliability/robustness, reward learning, and amplification seem great, though robustness seems particularly hard to achieve. While there is current work on adversarial training, interpretability and verification, even if all of the problems that researchers currently work on were magically solved, I don’t have a story for how that leads to robustness of (say) an agent trained by iterated amplification.
I am more conflicted about the desiderata. They seem very difficult to satisfy, and they don’t seem strictly necessary to achieve good outcomes. The underlying view here is that we should aim for something that we know is sufficient to achieve good outcomes, and only weaken our requirements if we find a fundamental obstacle. My main issue with this view is that even if it is true that the requirements are impossible to satisfy, it seems very hard to know this, and so we may spend a lot of time trying to satisfy these requirements and most of that work ends up being useless. I can imagine that we try to figure out ways to achieve robustness for several years in order to get a secure AI system, and it turns out that this is impossible to do in a way where we know it is robust, but in practice any AI system that we train will be sufficiently robust that it never fails catastrophically. In this world, we keep trying to achieve robustness, never find a fundamental obstruction, but also never succeed at creating a secure AI system.
Another way of phrasing this is that I am pessimistic about the prospects of conceptual thinking, which seems to be the main way by which we could find a fundamental obstruction. (Theory and empirical experiments can build intuitions about what is and isn’t hard, but given the complexities of the real world it seems unlikely that either would give us the sort of crystallized knowledge that you’re aiming for.) Phrased this way, I put less credence in this opinion, because I think there are a few examples of conceptual thinking being very important, though not that many.
The three directions of reliability/robustness, reward learning, and amplification seem great, though robustness seems particularly hard to achieve. While there is current work on adversarial training, interpretability and verification, even if all of the problems that researchers currently work on were magically solved, I don’t have a story for how that leads to robustness of (say) an agent trained by iterated amplification.
I am more conflicted about the desiderata. They seem very difficult to satisfy, and they don’t seem strictly necessary to achieve good outcomes. The underlying view here is that we should aim for something that we know is sufficient to achieve good outcomes, and only weaken our requirements if we find a fundamental obstacle. My main issue with this view is that even if it is true that the requirements are impossible to satisfy, it seems very hard to know this, and so we may spend a lot of time trying to satisfy these requirements and most of that work ends up being useless. I can imagine that we try to figure out ways to achieve robustness for several years in order to get a secure AI system, and it turns out that this is impossible to do in a way where we know it is robust, but in practice any AI system that we train will be sufficiently robust that it never fails catastrophically. In this world, we keep trying to achieve robustness, never find a fundamental obstruction, but also never succeed at creating a secure AI system.
Another way of phrasing this is that I am pessimistic about the prospects of conceptual thinking, which seems to be the main way by which we could find a fundamental obstruction. (Theory and empirical experiments can build intuitions about what is and isn’t hard, but given the complexities of the real world it seems unlikely that either would give us the sort of crystallized knowledge that you’re aiming for.) Phrased this way, I put less credence in this opinion, because I think there are a few examples of conceptual thinking being very important, though not that many.