Rohin Shah comments on AI Alignment Open Thread August 2019

Rohin Shah 7 Aug 2019 19:05 UTC
LW: 6 AF: 4
AF
This sounds like the normative claim, not the empirical one, given that you said “what we want is...”
- David Scott Krueger (formerly: capybaralet) 9 Aug 2019 16:13 UTC
  LW: 1 AF: 1
  AF Parent
  Yep, good catch ;)
  I *do* put a non-trivial weight on models where the empirical claim is true, and not just out of epistemic humility. But overall, I’m epistemically humble enough these days to think it’s not reasonable to say “nearly inevitable” if you integrate out epistemic uncertainty.
  But maybe it’s enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?
  Or are you just trying to see if anyone can defeat the epistemic humility “trump card”?
  - Rohin Shah 9 Aug 2019 22:14 UTC
    LW: 2 AF: 1
    AF Parent
    Or are you just trying to see if anyone can defeat the epistemic humility “trump card”?
    Partly (I’m surprised by how confident people generally seem to be, but that could just be a misinterpretation of their position), but also on my inside view the empirical claim is not true and I wanted to see if there were convincing arguments for it.
    But maybe it’s enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?
    Yeah, I’d be interested in your answers anyway.
    - David Scott Krueger (formerly: capybaralet) 15 Aug 2019 4:13 UTC
      LW: 1 AF: 1
      AF Parent
      I’m not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I’m actually more interested in hearing your take on those lines of argument than saying mine ATM :P
      - Rohin Shah 15 Aug 2019 19:57 UTC
        LW: 6 AF: 3
        AF Parent
        Re: convergent rationality, I don’t buy it (specifically the “convergent” part).
        Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile.
        But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”. I think the strongest argument against is “when you have an adversary optimizing against you, nothing short of proofs can give you confidence”, which seems to be somewhat true in security. But then I think there are ways that you can get confidence in “the AI system will not adversarially optimize against me” using techniques that are not proofs.
        (Note the alternative to proofs is not trial and error. I don’t use trial and error to successfully board a flight, but I also don’t have a proof that my strategy is going to cause me to successfully board a flight.)
        David Scott Krueger (formerly: capybaralet) 21 Aug 2019 16:18 UTC
        LW: 1 AF: 1
        AF Parent
        But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”.
        Totally agree; it’s an under-appreciated point!
        Here’s my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don’t actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)
        The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.
        This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.
        I’m personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the “early days” of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of “social epistemology” would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I’d argue we’re in the process of failing catastrophically at that)