Wei Dai comments on Where’s the first benign agent?

Wei Dai 26 Apr 2017 2:29 UTC
0 points
AF

I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it.

I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don’t see a realistic plan on Paul’s (or anyone else’s) part to deal with them.

Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s?

It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don’t see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it’s able to process them correctly. This seems intuitively obvious to me, but I don’t totally rule out that there is some sort of counterintuitive approach that could somehow work out.
- danieldewey 26 Apr 2017 4:13 UTC
  0 points
  AF Parent
  Ah, gotcha. I’ll think about those points—I don’t have a good response. (Actually adding “think about”+(link to this discussion) to my todo list.)
  
  It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.
  
  Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
  - Wei Dai 26 Apr 2017 19:41 UTC
    LW: 1 AF: 1
    AF Parent
    Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
    
    Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it’s currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible.
    
    It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it’s hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that.
    
    It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul’s kinds of designs).
    
    So I’m not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can’t foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.)