Wei Dai comments on Aligning a toy model of optimization

Wei Dai 29 Jun 2019 2:43 UTC
LW: 6 AF: 3
AF
(ETA: This comment may not make much sense unless the reader is familiar with section 2.2 of AI safety via debate.)

At the meta level, given Opt that we have to use as a black box, it seems like:
1. Building an unaligned AI corresponds to P or at most NP
2. Verifying that a flaw in a proposed aligned AI design actually is a flaw corresponds to P
3. Checking whether an AI design is aligned (or has an alignment flaw) corresponds to NP or co-NP
4. Designing an actually aligned AI corresponds to $Σ_{2}$ P
5. Determining that aligned AI is impossible corresponds to $Π_{2}$ P
6. Determining whether there is an aligned AI that comes with a clear argument for it being aligned corresponds to NP
7. Determining whether there is a clear argument for aligned AI being impossible corresponds to NP
Does that seem right? For the impossibility part you’re proposing to do 7 but since the actual problem is closer to 5, it could easily be the case that aligned AI is impossible but there is no clear argument for it. (I.e., there’s no short argument that can convince you and you have to do the $Π_{2}$ P computation instead.) So I would think that if 6 is false then that actually is (probably) bad news.
- Wei Dai 29 Jun 2019 20:25 UTC
  LW: 4 AF: 2
  AF Parent
  As one of the authors of the paper that introduced the idea of human analogues of computational complexity classes, which I’ve found to be really interesting and useful (see here for another place that I use it), I’m curious about your thoughts on the ways I’ve used it (e.g., am I misunderstanding the idea or misusing it) and whether you have any further thoughts about it yourself, such as what kinds of problems you expect to be outside the human equivalent of NP.
- paulfchristiano 2 Jul 2019 21:57 UTC
  LW: 2 AF: 1
  AF Parent
  Determining whether aligned AI is impossible seems harder than determining whether there is any hope for a knowably-aligned AI.
- paulfchristiano 30 Jun 2019 17:13 UTC
  LW: 2 AF: 1
  AF Parent
  I think that when a design problem is impossible, there is often an argument for why it’s impossible. Certainly that’s not obvious though, and you might just be out of luck. (That said, it’s also not obvious that problems in $N P$ are easier to solve than $Σ_{2} P$ , both contain problems you just can’t solve and so you are relying on extra structural facts in either case.)
  - Wei Dai 30 Jun 2019 20:51 UTC
    LW: 4 AF: 2
    AF Parent
    I think that when a design problem is impossible, there is often an argument for why it’s impossible.
    
    My knowledge/experience is limited but I know in cryptography we generally can’t find arguments for why it’s impossible to design an algorithm to break a cipher, and have to rely on the fact that lots of smart people have tried to attack a cipher and nobody has succeeded. (Sometimes we can have a “security proof” in the sense of a reduction from the security of a cipher to a problem like factoring, but then we’re still relying on the fact that lots of smart people have tried to design faster factoring algorithms and nobody has succeeded.) Note that this is only a $Π_{1}$ P or co-NP problem, whereas determining that aligned AI is impossible is in $Π_{2}$ P according to my analysis.
    
    That said, it’s also not obvious that problems in NP are easier to solve than $Σ_{2}$ P, both contain problems you just can’t solve and so you are relying on extra structural facts in either case.
    
    I guess that may be true in a raw compute sense, but there’s also an issue of how do you convince people that you’ve solved a problem? Again in crypto,
    
    NP problems (find a flaw in a cipher) get solved by individual researchers,
    co-NP problems (determine a cipher is secure) is “solved” by people trying and failing to break a cipher,
    $Σ_{2}$ P problems (design a fast cipher free from security flaws) get solved by government-sponsored contests that the whole field participates in, and
    $Π_{2}$ P problems (determine that any cipher faster than X can’t be secure) are left unsolved. (The fact that people tried and failed to design a secure cipher faster than X is really weak evidence that it’s impossible and of course nobody knows how to prove anything like this.)