David Scott Krueger (formerly: capybaralet) comments on AI Alignment Open Thread August 2019

David Scott Krueger (formerly: capybaralet) 15 Aug 2019 4:13 UTC
LW: 1 AF: 1
0
AF
I’m not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I’m actually more interested in hearing your take on those lines of argument than saying mine ATM :P
- Rohin Shah 15 Aug 2019 19:57 UTC
  LW: 6 AF: 3
  0
  AF Parent
  Re: convergent rationality, I don’t buy it (specifically the “convergent” part).
  Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile.
  But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”. I think the strongest argument against is “when you have an adversary optimizing against you, nothing short of proofs can give you confidence”, which seems to be somewhat true in security. But then I think there are ways that you can get confidence in “the AI system will not adversarially optimize against me” using techniques that are not proofs.
  (Note the alternative to proofs is not trial and error. I don’t use trial and error to successfully board a flight, but I also don’t have a proof that my strategy is going to cause me to successfully board a flight.)
  - David Scott Krueger (formerly: capybaralet) 21 Aug 2019 16:18 UTC
    LW: 1 AF: 1
    0
    AF Parent
    But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”.
    Totally agree; it’s an under-appreciated point!
    Here’s my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don’t actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)
    The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.
    This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.
    I’m personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the “early days” of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of “social epistemology” would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I’d argue we’re in the process of failing catastrophically at that)