When I joined AI safety in late 2017 (having read approximately nothing in the field), I thought of the problem as “construct a utility function for an AI system to optimize”, with a key challenge being the fragility of value. In hindsight this was clearly wrong.
The Value Learning sequence was in large part a result of my journey away from the utility function framing.
That being said, I suspect I continued to think that fragility-of-value type issues were a significant problem, probably until around mid-2019 (see next point).
(I did continue some projects more motivated from a fragility-of-value perspective, partly out of a heuristic of actually finishing things I start, and partly because I needed to write a PhD thesis.)
Early on, I thought of generalization as a key issue for deep learning and expected that vanilla deep learning would not lead to AGI for this reason. Again, in hindsight this was clearly wrong.
I was extremely surprised by OpenAI Five in June 2018 (not just that it worked, but also the ridiculous simplicity of the methods, in particular the lack of any hierarchical RL) and had to think through that.
I spent a while trying to understand that (at least months, e.g. you can see me continuing to be skeptical of deep learning in this Dec 2018 post).
I think I ended up close to my current views around early-to-mid-2019, e.g. I still broadly agree with the things I said in this August 2019 conversation (though I’ll note I was using “mesa optimizer” differently than it is used today—I agree with what I meant in that conversation, though I’d say it differently today).
I think by this point I was probably less worried about fragility of value. E.g. in that conversation I say a bunch of stuff that implies it’s less of a problem, most notably that AI systems will likely learn similar features as humans just from gradient descent, for reasons that LW would now call “natural abstractions”.
Note that this comment is presenting the situation as a lot cleaner than it actually was. I would bet there were many ways in which I was irrational / inconsistent, probably some times where I would have expressed verbally that fragility of value wasn’t a big deal but would still have defended research projects centered around it from some other perspective, etc.
Some thoughts on how to update based on past things I wrote:
I don’t think I’ve ever thought of myself as largely agreeing with LW: my relationship to LW has usually been “wow, they seem to be getting some obvious stuff wrong” (e.g. I was persuaded of slow takeoff basically when Paul’s post and AI Impacts’ post came out in Feb 2018, the Value Learning sequence in late 2018 was primarily in response to my perception that LW was way too anchored on the “construct a utility function” framing).
I think you don’t want to update too hard on the things that were said on blog posts addressed to an ML audience, or in papers that were submitted to conferences. Especially for the papers there’s just a lot of random stuff you couldn’t say about why you’re doing the work because then peer reviewers will object (e.g. I heard second hand of a particularly egregious review to the effect of: “this work is technically solid, but the motivation is AGI safety; I don’t believe in AGI so this paper should be rejected”).
Some thoughts on my journey in particular:
When I joined AI safety in late 2017 (having read approximately nothing in the field), I thought of the problem as “construct a utility function for an AI system to optimize”, with a key challenge being the fragility of value. In hindsight this was clearly wrong.
The Value Learning sequence was in large part a result of my journey away from the utility function framing.
That being said, I suspect I continued to think that fragility-of-value type issues were a significant problem, probably until around mid-2019 (see next point).
(I did continue some projects more motivated from a fragility-of-value perspective, partly out of a heuristic of actually finishing things I start, and partly because I needed to write a PhD thesis.)
Early on, I thought of generalization as a key issue for deep learning and expected that vanilla deep learning would not lead to AGI for this reason. Again, in hindsight this was clearly wrong.
I was extremely surprised by OpenAI Five in June 2018 (not just that it worked, but also the ridiculous simplicity of the methods, in particular the lack of any hierarchical RL) and had to think through that.
I spent a while trying to understand that (at least months, e.g. you can see me continuing to be skeptical of deep learning in this Dec 2018 post).
I think I ended up close to my current views around early-to-mid-2019, e.g. I still broadly agree with the things I said in this August 2019 conversation (though I’ll note I was using “mesa optimizer” differently than it is used today—I agree with what I meant in that conversation, though I’d say it differently today).
I think by this point I was probably less worried about fragility of value. E.g. in that conversation I say a bunch of stuff that implies it’s less of a problem, most notably that AI systems will likely learn similar features as humans just from gradient descent, for reasons that LW would now call “natural abstractions”.
Note that this comment is presenting the situation as a lot cleaner than it actually was. I would bet there were many ways in which I was irrational / inconsistent, probably some times where I would have expressed verbally that fragility of value wasn’t a big deal but would still have defended research projects centered around it from some other perspective, etc.
Some thoughts on how to update based on past things I wrote:
I don’t think I’ve ever thought of myself as largely agreeing with LW: my relationship to LW has usually been “wow, they seem to be getting some obvious stuff wrong” (e.g. I was persuaded of slow takeoff basically when Paul’s post and AI Impacts’ post came out in Feb 2018, the Value Learning sequence in late 2018 was primarily in response to my perception that LW was way too anchored on the “construct a utility function” framing).
I think you don’t want to update too hard on the things that were said on blog posts addressed to an ML audience, or in papers that were submitted to conferences. Especially for the papers there’s just a lot of random stuff you couldn’t say about why you’re doing the work because then peer reviewers will object (e.g. I heard second hand of a particularly egregious review to the effect of: “this work is technically solid, but the motivation is AGI safety; I don’t believe in AGI so this paper should be rejected”).