I have a big gap between “stuff I’ve written up” and “stuff that I’d like to write up.” Some particular ideas that come to mind: how epistemic competitiveness seems really important for alignment; how I think about questions like “aligned with whom” and why I think it’s good to try to decouple alignment techniques from decisions about values / preference aggregation (this position is surprisingly controversial); updated views on the basic dichotomy in Two Kinds of Generalization and the current best hopes for avoiding the bad kind.
I think that there’s a cluster of really important questions about what we can verify, how “alien” the knowledge of ML systems will be, and how realistic it’s going to be to take a kind of ad hoc approach to alignment. In my experience people with a more experimental bent to be more optimistic about those questions tend to have a bunch of intuitions about those questions that do kind of hang together (and are often approximately shared across people). This comes with some more color on the current alignment plan / what’s likely to happen in practice as people try to solve the problem on their feet. I don’t think that’s really been written up well but it seems important.
I think the MIRI crowd has some hard-to-articulate views about why ML is likely to produce consequentialist behavior, especially OOD, that aren’t written up at all or very well. In general I think MIRI folks have a lot of ideas that aren’t really written up, though I’m not sure they really do much floating around outside of MIRI.
Sorry that none of those are really crisp ideas. Probably my favorite one is the first about epistemic competitiveness but I think that’s largely because I’m me, and that idea is central to my own thinking, rather than any kind of objective evaluation.
The stuff about ‘alien’ knowledge sounds really fascinating, and I’d be excited about write-ups. All my concrete intuitions here come from reading Distill.Pub papers.
I have a big gap between “stuff I’ve written up” and “stuff that I’d like to write up.” Some particular ideas that come to mind: how epistemic competitiveness seems really important for alignment; how I think about questions like “aligned with whom” and why I think it’s good to try to decouple alignment techniques from decisions about values / preference aggregation (this position is surprisingly controversial); updated views on the basic dichotomy in Two Kinds of Generalization and the current best hopes for avoiding the bad kind.
I think that there’s a cluster of really important questions about what we can verify, how “alien” the knowledge of ML systems will be, and how realistic it’s going to be to take a kind of ad hoc approach to alignment. In my experience people with a more experimental bent to be more optimistic about those questions tend to have a bunch of intuitions about those questions that do kind of hang together (and are often approximately shared across people). This comes with some more color on the current alignment plan / what’s likely to happen in practice as people try to solve the problem on their feet. I don’t think that’s really been written up well but it seems important.
I think the MIRI crowd has some hard-to-articulate views about why ML is likely to produce consequentialist behavior, especially OOD, that aren’t written up at all or very well. In general I think MIRI folks have a lot of ideas that aren’t really written up, though I’m not sure they really do much floating around outside of MIRI.
Sorry that none of those are really crisp ideas. Probably my favorite one is the first about epistemic competitiveness but I think that’s largely because I’m me, and that idea is central to my own thinking, rather than any kind of objective evaluation.
The stuff about ‘alien’ knowledge sounds really fascinating, and I’d be excited about write-ups. All my concrete intuitions here come from reading Distill.Pub papers.