Shmi comments on Research Agenda in reverse: what would a solution look like?

Shmi 26 Jun 2019 15:25 UTC
LW: 4 AF: 1
AF
I agree that 4 needs to be taken seriously, as 1 and 2 are hard to succeed at without making a lot of progress on 4, and 3 is just a catch-all for every other approach. It is also the hardest, as it probably requires breaking a lot of new ground, so people tend to work on what appears solvable. I thought some people are working on it though, no? There is also a chance of proving that “An actual grounded definition of human preferences” is impossible in a self-consistent way, and we would have to figure out what to do in this case. The latter feels like a real possibility to me,
- Gordon Seidoh Worley 26 Jun 2019 18:07 UTC
  LW: 5 AF: 2
  AF Parent
  My impression continues to be that (4) is neglected. Stuart has been the most prolific person I can think of to work on this question, and it’s a fast falling power distribution after that with myself having done some work and then not much else that comes to mind in terms of work to address (4) in a technical manner that might lead to solutions useful for AI safety.
  I have no doubt others have done things (Alexey has thought (and maybe published?) some on this), and others could probably forget my work or Stuart’s as easily as I’ve forgotten there because we don’t have a lot of momentum on this problem right now to keep it fresh in our minds. Or so is my impression of things now. I’ve had some good conversations with folks and a few seem excited about working on (4) and they seem qualified in ways to do it, but no one but Stuart has yet produced very much published work on it.
  (Yes, there is Eliezer’s work on CEV, which is more like a placeholder and wishful thinking than anything more serious, and it has probably accidentally been the biggest bottleneck to work on (4) because so many people I talk to say things like “oh, we can just do CEV and be done with this, so let’s worry about the real problems”.)
  I agree there is a risk it is an impossible problem, and I actually think it’s quite high in that we may not be able to adequately aggregate human preferences in ways that result in something coherent. In that case I view safety and alignment as more about avoiding catastrophe and cutting down aligned AI solution space to remove the things that clearly don’t work rather than building towards things that clearly do. I hope I’m being too pessimistic.
  - Vaniver 26 Jun 2019 23:17 UTC
    LW: 13 AF: 3
    AF Parent
    In my experience, people mostly haven’t had the view of “we can just do CEV, it’ll be fine” and instead have had the view of “before we figure out what our preferences are, which is an inherently political and messy question, let’s figure out how to load any preferences at all.”
    It seems like there needs to be some interplay here—”what we can load” informs “what shape we should force our preferences into” and “what shape our preferences actually are” informs “what loading needs to be capable of to count as aligned.”
  - johnswentworth 26 Jun 2019 23:25 UTC
    4 points
    Parent
    I wouldn’t say it’s neglected, just that people are busy laying foundation and that it’s probably too early to tackle the problem directly. In particular, grounding the preferences of real-world agents is an obvious application for any potential theory of embedded agency. (At least the way I think about it, grounding models and preferences is the main problem of embedded agency.)
- dxu 26 Jun 2019 16:32 UTC
  2 points
  AF Parent
  1 and 2 are hard to succeed at without making a lot of progress on 4
  It’s not obvious to me why this ought to be the case. Could you elaborate?
  - johnswentworth 26 Jun 2019 23:19 UTC
    14 points
    Parent
    Even if we succeeded at (1), it would be hard to know that we’d succeeded without progress on (4). If we’re using one or more proxies, we don’t have a way to talk about how accurate they are without (4) - we can’t evaluate how closely the proxies match the thing they’re supposed to proxy, without grounding that thing.
    For (2), if we want to talk about “low-impact” or anything like it, then we need a grounding of what kind of impact we care about—and that question falls under (4). If we forget about some kind of impact that humans actually do care about, then we’re in trouble.
    - Stuart_Armstrong 27 Jun 2019 11:29 UTC
      2 points
      Parent
      Yep ^_^ I make those points in the research agenda (section 3).
    - Shmi 27 Jun 2019 0:52 UTC
      2 points
      Parent
      Exactly. You explained it better than I could :)
  - TurnTrout 26 Jun 2019 21:52 UTC
    LW: 2 AF: 1
    AF Parent
    I also am curious why this should be so.
    I also continue to disagree with Stuart on low impact in particular being intractable without learning human values.
    - Stuart_Armstrong 27 Jun 2019 11:28 UTC
      LW: 2 AF: 1
      AF Parent
      To be precise: I argue low impact is intractable without learning a subset of human values; the full set is not needed.
      - TurnTrout 27 Jun 2019 14:13 UTC
        LW: 2 AF: 1
        AF Parent
        Thanks for clarifying! I haven’t brought this up on your research agenda because I prefer to have the discussion during an upcoming sequence of mine, and it felt unfair to comment on your agenda, “I disagree but I won’t elaborate right now”.

Shmi comments on Research Agenda in reverse: what *would* a solution look like?

Shmi comments on Research Agenda in reverse: what would a solution look like?