Thanks a lot for your detailed reply and sorry for my slow response (I had to take some exams!).
Regarding terminal goals the only compelling one I have come across is coherent extrapolated volition as outlined in Superintelligence. But how to even program this into code is of course problematic and I haven’t followed the literature closely since for rebuttals or better ideas.
I enjoyed your piece on Steered Optimizers, and think it has helped give me examples where the algorithmic design and inductive biases can play a part in how controllable our system is. This also brings to mind this piece which I suspect you may really enjoy: https://www.gwern.net/Backstop.
I am quite a believer in fast takeoff scenarios so I am unsure to what extent we can control a full AGI, but until it reaches criticality the tools we have to test and control it will indeed be crucial.
One concern I have that you might be able to address is that evolution did not optimize for interpretability! While DNNs are certainly quite black box, they remain more interpretable than the brain. I assign some prior probability to the same relative interpretability of DNNs vs neocortex based AGI.
Another concern is with the human morals that you mentioned. This should certainly be investigated further but I don’t think almost any human has an internally consistent set of morals. In addition, I think that the morals we have were selected by the selfish gene and even if we could re-simulate them through an evolutionary like process we would get the good with the bad. https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/ and a few other evolutionary biology books have shaped my thinking on this.
Regarding terminal goals the only compelling one I have come across is coherent extrapolated volition as outlined in Superintelligence. But how to even program this into code is of course problematic and I haven’t followed the literature closely since for rebuttals or better ideas.
I think the most popular alternatives to CEV are “Do what I, the programmer, want you to do”, argued most prominently by Paul Christiano (cf. “Approval-directed agents”), variations on that (Stuart Russell’s book talks about showing a person different ways that their future could unfold and have them pick their favorite), task-limited AGI (“just do this one specific thing without causing general mayhem”) (I believe Eliezer was advocating for solving this problem before trying to make a CEV maximizer), and lots of ideas for systems that don’t look like agents with goals (e.g. CAIS). A lot of these “kick the can down the road” and don’t try to answer big questions about the future, on the theory that future people with AGI helpers will be in a better position to figure out subsequent steps forward.
evolution did not optimize for interpretability!
Sure, and neither did Yann Lecun. I don’t know whether a DNN would be more or less intepretable than a neocortex with the same information content. I think we desperately need a clearer vision for what “interpretability tools” would look like in both cases, such that they would scale all the way to AGI. I (currently) see no way around having intepretability be a big part of the solution.
I don’t think almost any human has an internally consistent set of morals
Strong agree. I do think we have a suite of social instincts which are largely common between people and hard-coded by evolution. But the instincts don’t add up to an internally consistent framework of morality.
even if we could re-simulate them through an evolutionary like process we would get the good with the bad.
I’m generally not assuming that we will run search processes that parallel what evolution did. I mean, maybe, I just don’t think it’s that likely, and it’s not the scenario I’m trying to think through. I think people are very good at figuring out algorithms based on their desired input-output relations, and then coding them up, whereas evolution-like searches over learning algorithms is ridiculously computationally expensive and has little precedent. (E.g. we invented ConvNets, we didn’t discover ConvNets by an evolutionary search.) Evolution has put learning algorithms into the neocortex and cerebellum and amygdala, and I think humans will figure out what these learning algorithms are and directly write code implementing them. Evolution has put non-learning algorithms into the brainstem, and I suspect that the social instincts are in this category, and I suspect that if we make AGI with (some) human-like social instincts, it would be by people writing code that implements a subset of those algorithms or something similar. I think the algorithms are not understood right now, and may well not be by the time we get AGI, and I think that’s a bad thing, closing off an option.
Thanks a lot for your detailed reply and sorry for my slow response (I had to take some exams!).
Regarding terminal goals the only compelling one I have come across is coherent extrapolated volition as outlined in Superintelligence. But how to even program this into code is of course problematic and I haven’t followed the literature closely since for rebuttals or better ideas.
I enjoyed your piece on Steered Optimizers, and think it has helped give me examples where the algorithmic design and inductive biases can play a part in how controllable our system is. This also brings to mind this piece which I suspect you may really enjoy: https://www.gwern.net/Backstop.
I am quite a believer in fast takeoff scenarios so I am unsure to what extent we can control a full AGI, but until it reaches criticality the tools we have to test and control it will indeed be crucial.
One concern I have that you might be able to address is that evolution did not optimize for interpretability! While DNNs are certainly quite black box, they remain more interpretable than the brain. I assign some prior probability to the same relative interpretability of DNNs vs neocortex based AGI.
Another concern is with the human morals that you mentioned. This should certainly be investigated further but I don’t think almost any human has an internally consistent set of morals. In addition, I think that the morals we have were selected by the selfish gene and even if we could re-simulate them through an evolutionary like process we would get the good with the bad. https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/ and a few other evolutionary biology books have shaped my thinking on this.
Thanks for the gwern link!
I think the most popular alternatives to CEV are “Do what I, the programmer, want you to do”, argued most prominently by Paul Christiano (cf. “Approval-directed agents”), variations on that (Stuart Russell’s book talks about showing a person different ways that their future could unfold and have them pick their favorite), task-limited AGI (“just do this one specific thing without causing general mayhem”) (I believe Eliezer was advocating for solving this problem before trying to make a CEV maximizer), and lots of ideas for systems that don’t look like agents with goals (e.g. CAIS). A lot of these “kick the can down the road” and don’t try to answer big questions about the future, on the theory that future people with AGI helpers will be in a better position to figure out subsequent steps forward.
Sure, and neither did Yann Lecun. I don’t know whether a DNN would be more or less intepretable than a neocortex with the same information content. I think we desperately need a clearer vision for what “interpretability tools” would look like in both cases, such that they would scale all the way to AGI. I (currently) see no way around having intepretability be a big part of the solution.
Strong agree. I do think we have a suite of social instincts which are largely common between people and hard-coded by evolution. But the instincts don’t add up to an internally consistent framework of morality.
I’m generally not assuming that we will run search processes that parallel what evolution did. I mean, maybe, I just don’t think it’s that likely, and it’s not the scenario I’m trying to think through. I think people are very good at figuring out algorithms based on their desired input-output relations, and then coding them up, whereas evolution-like searches over learning algorithms is ridiculously computationally expensive and has little precedent. (E.g. we invented ConvNets, we didn’t discover ConvNets by an evolutionary search.) Evolution has put learning algorithms into the neocortex and cerebellum and amygdala, and I think humans will figure out what these learning algorithms are and directly write code implementing them. Evolution has put non-learning algorithms into the brainstem, and I suspect that the social instincts are in this category, and I suspect that if we make AGI with (some) human-like social instincts, it would be by people writing code that implements a subset of those algorithms or something similar. I think the algorithms are not understood right now, and may well not be by the time we get AGI, and I think that’s a bad thing, closing off an option.