Without having thought about this much it seems to me like the control/alignment problem depends upon the terminal goals we provide the AGI rather than the substrate and algorithms it is running to obtain AGI level intelligence.
I agree that if an AGI has terminal goals, they should be terminal goals that yield good results for the future. (And don’t ask me what those are!) So that is indeed one aspect of the control/alignment problem. However, I’m not aware of any fleshed-out plan for AGI where you get to just write down its terminal goals, or indeed where the AGI necessarily even has terminal goals in the first place. (For example, what language would the terminal goals be written in?)
Instead, the first step would be to figure out how to build the system such that it winds up having the goals that you want it to have. That part is called the “inner alignment problem”, it is still an open problem, and I would argue that it’s a different open problem for each possible AGI algorithm architecture—since different algorithms can acquire / develop goals via different processes. (See here for excessive detail on that.)
Can you provide some examples of what discoveries would indicate that this is an AGI route that is very dangerous or safe?
Sure!
One possible path to a (plausibly) good AGI future is try to build AGIs that have a similar suite of moral intuitions that humans have. We want this to work really well, even in weird out-of-distribution situations (the future will be weird!), so we should ideally try to make this happen by using similar underlying algorithms as humans. Then they’ll have the same inductive biases etc. This especially would involve human social instincts. I don’t really understand how human social instincts work (I do have vague ideas I’m excited to further explore! But I don’t actually know.) It may turn out to be the case that the algorithms supporting those instincts rely in a deep and inextricable way on having a human body and interacting with humans in a human world. That would be evidence that this potential path is not going to work. Likewise, maybe we’ll find out that you fundamentally can’t have, say, a sympathy instinct without also having, say, jealousy and self-interest. Again, that would be evidence that this potential path is not as promising as it initially sounded. Conversely, we could figure out how the instincts work, see no barriers to putting them into an AGI, see how to set them up to be non-subvertable (such that the AGI cannot dehumanize / turn off sympathy), etc. Then (after a lot more work) we could form an argument that this is a good path forward, and start advocating for everyone to go down this development path rather than other potential paths to AGI.
If we find a specific, nice, verifiable plan to keep neocortex-like AGIs permanently under human control (and not themselves suffering, if you think that AGI suffering possible & important), then that’s evidence that neocortex-like AGIs are a good route. Such a plan would almost certainly be algorithm-specific. See my random example here—in this example, the idea of an AGI acting conservatively sounds nice and all, but the only way to really assess the idea (and its strengths and weaknesses and edge-cases) is to make assumptions about what the AGI algorithm will be like. (I don’t pretend that that particular blog post is all that great, but I bet that with 600 more blog posts like that, maybe we would start getting somewhere...) If we can’t find such a plan, can we find a proof that no such plan exists? Either way would be very valuable. This is parallel to what Paul Christiano and colleagues at OpenAI & Ought have been trying to do, but their assumption is that AGI algorithms will be vaguely similar to today’s deep neural networks, rather than my assumption that AGI algorithms will be vaguely similar to a brain’s. (To be clear, I have nothing against assuming that AGI algorithms will be vaguely similar to today’s DNNs. That is definitely one possibility. I think both these research efforts should be happening in parallel.)
I could go on. But anyway, does that help? If I said something confusing or jargon-y, just ask, I don’t know your background. :-)
Thanks a lot for your detailed reply and sorry for my slow response (I had to take some exams!).
Regarding terminal goals the only compelling one I have come across is coherent extrapolated volition as outlined in Superintelligence. But how to even program this into code is of course problematic and I haven’t followed the literature closely since for rebuttals or better ideas.
I enjoyed your piece on Steered Optimizers, and think it has helped give me examples where the algorithmic design and inductive biases can play a part in how controllable our system is. This also brings to mind this piece which I suspect you may really enjoy: https://www.gwern.net/Backstop.
I am quite a believer in fast takeoff scenarios so I am unsure to what extent we can control a full AGI, but until it reaches criticality the tools we have to test and control it will indeed be crucial.
One concern I have that you might be able to address is that evolution did not optimize for interpretability! While DNNs are certainly quite black box, they remain more interpretable than the brain. I assign some prior probability to the same relative interpretability of DNNs vs neocortex based AGI.
Another concern is with the human morals that you mentioned. This should certainly be investigated further but I don’t think almost any human has an internally consistent set of morals. In addition, I think that the morals we have were selected by the selfish gene and even if we could re-simulate them through an evolutionary like process we would get the good with the bad. https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/ and a few other evolutionary biology books have shaped my thinking on this.
Regarding terminal goals the only compelling one I have come across is coherent extrapolated volition as outlined in Superintelligence. But how to even program this into code is of course problematic and I haven’t followed the literature closely since for rebuttals or better ideas.
I think the most popular alternatives to CEV are “Do what I, the programmer, want you to do”, argued most prominently by Paul Christiano (cf. “Approval-directed agents”), variations on that (Stuart Russell’s book talks about showing a person different ways that their future could unfold and have them pick their favorite), task-limited AGI (“just do this one specific thing without causing general mayhem”) (I believe Eliezer was advocating for solving this problem before trying to make a CEV maximizer), and lots of ideas for systems that don’t look like agents with goals (e.g. CAIS). A lot of these “kick the can down the road” and don’t try to answer big questions about the future, on the theory that future people with AGI helpers will be in a better position to figure out subsequent steps forward.
evolution did not optimize for interpretability!
Sure, and neither did Yann Lecun. I don’t know whether a DNN would be more or less intepretable than a neocortex with the same information content. I think we desperately need a clearer vision for what “interpretability tools” would look like in both cases, such that they would scale all the way to AGI. I (currently) see no way around having intepretability be a big part of the solution.
I don’t think almost any human has an internally consistent set of morals
Strong agree. I do think we have a suite of social instincts which are largely common between people and hard-coded by evolution. But the instincts don’t add up to an internally consistent framework of morality.
even if we could re-simulate them through an evolutionary like process we would get the good with the bad.
I’m generally not assuming that we will run search processes that parallel what evolution did. I mean, maybe, I just don’t think it’s that likely, and it’s not the scenario I’m trying to think through. I think people are very good at figuring out algorithms based on their desired input-output relations, and then coding them up, whereas evolution-like searches over learning algorithms is ridiculously computationally expensive and has little precedent. (E.g. we invented ConvNets, we didn’t discover ConvNets by an evolutionary search.) Evolution has put learning algorithms into the neocortex and cerebellum and amygdala, and I think humans will figure out what these learning algorithms are and directly write code implementing them. Evolution has put non-learning algorithms into the brainstem, and I suspect that the social instincts are in this category, and I suspect that if we make AGI with (some) human-like social instincts, it would be by people writing code that implements a subset of those algorithms or something similar. I think the algorithms are not understood right now, and may well not be by the time we get AGI, and I think that’s a bad thing, closing off an option.
Ooh, good questions!
I agree that if an AGI has terminal goals, they should be terminal goals that yield good results for the future. (And don’t ask me what those are!) So that is indeed one aspect of the control/alignment problem. However, I’m not aware of any fleshed-out plan for AGI where you get to just write down its terminal goals, or indeed where the AGI necessarily even has terminal goals in the first place. (For example, what language would the terminal goals be written in?)
Instead, the first step would be to figure out how to build the system such that it winds up having the goals that you want it to have. That part is called the “inner alignment problem”, it is still an open problem, and I would argue that it’s a different open problem for each possible AGI algorithm architecture—since different algorithms can acquire / develop goals via different processes. (See here for excessive detail on that.)
Sure!
One possible path to a (plausibly) good AGI future is try to build AGIs that have a similar suite of moral intuitions that humans have. We want this to work really well, even in weird out-of-distribution situations (the future will be weird!), so we should ideally try to make this happen by using similar underlying algorithms as humans. Then they’ll have the same inductive biases etc. This especially would involve human social instincts. I don’t really understand how human social instincts work (I do have vague ideas I’m excited to further explore! But I don’t actually know.) It may turn out to be the case that the algorithms supporting those instincts rely in a deep and inextricable way on having a human body and interacting with humans in a human world. That would be evidence that this potential path is not going to work. Likewise, maybe we’ll find out that you fundamentally can’t have, say, a sympathy instinct without also having, say, jealousy and self-interest. Again, that would be evidence that this potential path is not as promising as it initially sounded. Conversely, we could figure out how the instincts work, see no barriers to putting them into an AGI, see how to set them up to be non-subvertable (such that the AGI cannot dehumanize / turn off sympathy), etc. Then (after a lot more work) we could form an argument that this is a good path forward, and start advocating for everyone to go down this development path rather than other potential paths to AGI.
If we find a specific, nice, verifiable plan to keep neocortex-like AGIs permanently under human control (and not themselves suffering, if you think that AGI suffering possible & important), then that’s evidence that neocortex-like AGIs are a good route. Such a plan would almost certainly be algorithm-specific. See my random example here—in this example, the idea of an AGI acting conservatively sounds nice and all, but the only way to really assess the idea (and its strengths and weaknesses and edge-cases) is to make assumptions about what the AGI algorithm will be like. (I don’t pretend that that particular blog post is all that great, but I bet that with 600 more blog posts like that, maybe we would start getting somewhere...) If we can’t find such a plan, can we find a proof that no such plan exists? Either way would be very valuable. This is parallel to what Paul Christiano and colleagues at OpenAI & Ought have been trying to do, but their assumption is that AGI algorithms will be vaguely similar to today’s deep neural networks, rather than my assumption that AGI algorithms will be vaguely similar to a brain’s. (To be clear, I have nothing against assuming that AGI algorithms will be vaguely similar to today’s DNNs. That is definitely one possibility. I think both these research efforts should be happening in parallel.)
I could go on. But anyway, does that help? If I said something confusing or jargon-y, just ask, I don’t know your background. :-)
Thanks a lot for your detailed reply and sorry for my slow response (I had to take some exams!).
Regarding terminal goals the only compelling one I have come across is coherent extrapolated volition as outlined in Superintelligence. But how to even program this into code is of course problematic and I haven’t followed the literature closely since for rebuttals or better ideas.
I enjoyed your piece on Steered Optimizers, and think it has helped give me examples where the algorithmic design and inductive biases can play a part in how controllable our system is. This also brings to mind this piece which I suspect you may really enjoy: https://www.gwern.net/Backstop.
I am quite a believer in fast takeoff scenarios so I am unsure to what extent we can control a full AGI, but until it reaches criticality the tools we have to test and control it will indeed be crucial.
One concern I have that you might be able to address is that evolution did not optimize for interpretability! While DNNs are certainly quite black box, they remain more interpretable than the brain. I assign some prior probability to the same relative interpretability of DNNs vs neocortex based AGI.
Another concern is with the human morals that you mentioned. This should certainly be investigated further but I don’t think almost any human has an internally consistent set of morals. In addition, I think that the morals we have were selected by the selfish gene and even if we could re-simulate them through an evolutionary like process we would get the good with the bad. https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/ and a few other evolutionary biology books have shaped my thinking on this.
Thanks for the gwern link!
I think the most popular alternatives to CEV are “Do what I, the programmer, want you to do”, argued most prominently by Paul Christiano (cf. “Approval-directed agents”), variations on that (Stuart Russell’s book talks about showing a person different ways that their future could unfold and have them pick their favorite), task-limited AGI (“just do this one specific thing without causing general mayhem”) (I believe Eliezer was advocating for solving this problem before trying to make a CEV maximizer), and lots of ideas for systems that don’t look like agents with goals (e.g. CAIS). A lot of these “kick the can down the road” and don’t try to answer big questions about the future, on the theory that future people with AGI helpers will be in a better position to figure out subsequent steps forward.
Sure, and neither did Yann Lecun. I don’t know whether a DNN would be more or less intepretable than a neocortex with the same information content. I think we desperately need a clearer vision for what “interpretability tools” would look like in both cases, such that they would scale all the way to AGI. I (currently) see no way around having intepretability be a big part of the solution.
Strong agree. I do think we have a suite of social instincts which are largely common between people and hard-coded by evolution. But the instincts don’t add up to an internally consistent framework of morality.
I’m generally not assuming that we will run search processes that parallel what evolution did. I mean, maybe, I just don’t think it’s that likely, and it’s not the scenario I’m trying to think through. I think people are very good at figuring out algorithms based on their desired input-output relations, and then coding them up, whereas evolution-like searches over learning algorithms is ridiculously computationally expensive and has little precedent. (E.g. we invented ConvNets, we didn’t discover ConvNets by an evolutionary search.) Evolution has put learning algorithms into the neocortex and cerebellum and amygdala, and I think humans will figure out what these learning algorithms are and directly write code implementing them. Evolution has put non-learning algorithms into the brainstem, and I suspect that the social instincts are in this category, and I suspect that if we make AGI with (some) human-like social instincts, it would be by people writing code that implements a subset of those algorithms or something similar. I think the algorithms are not understood right now, and may well not be by the time we get AGI, and I think that’s a bad thing, closing off an option.