I think theoretical work on AI safety has multiple different benefits, but I prefer a slightly different categorization. I like categorizing in terms of the sort of safety guarantees we can get, on a spectrum from “stronger but harder to get” to “weaker but easier to get”. Specifically, the reasonable goals for such research IMO are as follows.
Plan A is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a proof that this algorithm is aligned, or at least a solid base of theoretical and empirical evidence, similarly to the situation in cryptography. This more or less correspond to World 2.
Plan B is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a specific impractical but provably aligned algorithm (iv) informal and empirical arguments suggesting that the former algorithm is as aligned as the latter. As an analogy consider Q-learning (an impractical algorithm with provable convergence guarantees) and deep Q-learning (a practical algorithm with no currently known convergence guarantees, designed by analogy to the former). This sort of still corresponds to World 2 but not quite.
Plan C is having enough theory to at least have rigorous models of all possible failure modes, and theory-inspired informal and empirical arguments why a certain algorithm avoids them. As an analogy, concepts such as VC dimension and Rademacher complexity allow us being more precise in our reasoning about underfitting and overfitting, even if we don’t know how to compute them in practical scenarios. This corresponds to World 1, I guess?
In a sane civilization the solution would be not building AGI until we can implement Plan A. In the real civilization, we should go with the best plan that will be ready by the time competing projects become too dangerous to ignore.
World 3 seems too ambitious to me, since analyzing arbitrary code is almost always an intractable problem (e.g. Rice’s theorem). You would need at least some constraints on how your agent is designed.
I think that the plans you lay out are all directly talking about the AI system we eventually build, and as a result I’m more optimistic about them (and your work, as it’s easy to see how it makes progress towards these plans) relative to HRAD.
In contrast, as far as I can tell, HRAD work does not directly contribute to any of these plans, and instead the case seems to rely on something more indirect where a better understanding of reasoning will later help us execute on one of these plans. It’s this indirection that makes me worried.
Well, HRAD certainly has relations to my own research programme. Embedded agency seems important since human values are probably “embedded” to some extent, counterfactuals are important for translating knowledge from the user’s subjective vantage point to the AI’s subjective vantage point, reflection is important if it’s required for high capability (as Turning RL suggests). I do agree that having a high level plan for solving the problem is important to focus the research in the right directions.
I can barely see how this is possible if we’re talking about alignment to humans, even with a hypothetical formal theory of embedded agency. Do you imagine human values are cleanly represented and extractable, and that we can (potentially very indirectly) reference those values formally? Do you mean something else by “formalization of alignment” that doesn’t involve formal descriptions of human minds?
I think theoretical work on AI safety has multiple different benefits, but I prefer a slightly different categorization. I like categorizing in terms of the sort of safety guarantees we can get, on a spectrum from “stronger but harder to get” to “weaker but easier to get”. Specifically, the reasonable goals for such research IMO are as follows.
Plan A is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a proof that this algorithm is aligned, or at least a solid base of theoretical and empirical evidence, similarly to the situation in cryptography. This more or less correspond to World 2.
Plan B is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a specific impractical but provably aligned algorithm (iv) informal and empirical arguments suggesting that the former algorithm is as aligned as the latter. As an analogy consider Q-learning (an impractical algorithm with provable convergence guarantees) and deep Q-learning (a practical algorithm with no currently known convergence guarantees, designed by analogy to the former). This sort of still corresponds to World 2 but not quite.
Plan C is having enough theory to at least have rigorous models of all possible failure modes, and theory-inspired informal and empirical arguments why a certain algorithm avoids them. As an analogy, concepts such as VC dimension and Rademacher complexity allow us being more precise in our reasoning about underfitting and overfitting, even if we don’t know how to compute them in practical scenarios. This corresponds to World 1, I guess?
In a sane civilization the solution would be not building AGI until we can implement Plan A. In the real civilization, we should go with the best plan that will be ready by the time competing projects become too dangerous to ignore.
World 3 seems too ambitious to me, since analyzing arbitrary code is almost always an intractable problem (e.g. Rice’s theorem). You would need at least some constraints on how your agent is designed.
I think that the plans you lay out are all directly talking about the AI system we eventually build, and as a result I’m more optimistic about them (and your work, as it’s easy to see how it makes progress towards these plans) relative to HRAD.
In contrast, as far as I can tell, HRAD work does not directly contribute to any of these plans, and instead the case seems to rely on something more indirect where a better understanding of reasoning will later help us execute on one of these plans. It’s this indirection that makes me worried.
Well, HRAD certainly has relations to my own research programme. Embedded agency seems important since human values are probably “embedded” to some extent, counterfactuals are important for translating knowledge from the user’s subjective vantage point to the AI’s subjective vantage point, reflection is important if it’s required for high capability (as Turning RL suggests). I do agree that having a high level plan for solving the problem is important to focus the research in the right directions.
I can barely see how this is possible if we’re talking about alignment to humans, even with a hypothetical formal theory of embedded agency. Do you imagine human values are cleanly represented and extractable, and that we can (potentially very indirectly) reference those values formally? Do you mean something else by “formalization of alignment” that doesn’t involve formal descriptions of human minds?
For examples of what a formalization of alignment could look like, see this and this.