(ETA: This comment may not make much sense unless the reader is familiar with section 2.2 of AI safety via debate.)
At the meta level, given Opt that we have to use as a black box, it seems like:
Building an unaligned AI corresponds to P or at most NP
Verifying that a flaw in a proposed aligned AI design actually is a flaw corresponds to P
Checking whether an AI design is aligned (or has an alignment flaw) corresponds to NP or co-NP
Designing an actually aligned AI corresponds to Σ2P
Determining that aligned AI is impossible corresponds to Π2P
Determining whether there is an aligned AI that comes with a clear argument for it being aligned corresponds to NP
Determining whether there is a clear argument for aligned AI being impossible corresponds to NP
Does that seem right? For the impossibility part you’re proposing to do 7 but since the actual problem is closer to 5, it could easily be the case that aligned AI is impossible but there is no clear argument for it. (I.e., there’s no short argument that can convince you and you have to do the Π2P computation instead.) So I would think that if 6 is false then that actually is (probably) bad news.
As one of the authors of the paper that introduced the idea of human analogues of computational complexity classes, which I’ve found to be really interesting and useful (see here for another place that I use it), I’m curious about your thoughts on the ways I’ve used it (e.g., am I misunderstanding the idea or misusing it) and whether you have any further thoughts about it yourself, such as what kinds of problems you expect to be outside the human equivalent of NP.
I think that when a design problem is impossible, there is often an argument for why it’s impossible. Certainly that’s not obvious though, and you might just be out of luck. (That said, it’s also not obvious that problems in NP are easier to solve than Σ2P, both contain problems you just can’t solve and so you are relying on extra structural facts in either case.)
I think that when a design problem is impossible, there is often an argument for why it’s impossible.
My knowledge/experience is limited but I know in cryptography we generally can’t find arguments for why it’s impossible to design an algorithm to break a cipher, and have to rely
on the fact that lots of smart people have tried to attack a cipher and nobody has succeeded. (Sometimes we can have a “security proof” in the sense of a reduction from the security of a cipher to a problem like factoring, but then we’re still relying on the fact that lots of smart people have tried to design faster factoring algorithms and nobody has succeeded.) Note that this is only a Π1P or co-NP problem, whereas determining that aligned AI is impossible is in Π2P according to my analysis.
That said, it’s also not obvious that problems in NP are easier to solve than Σ2P, both contain problems you just can’t solve and so you are relying on extra structural facts in either case.
I guess that may be true in a raw compute sense, but there’s also an issue of how do you convince people that you’ve solved a problem? Again in crypto,
NP problems (find a flaw in a cipher) get solved by individual researchers,
co-NP problems (determine a cipher is secure) is “solved” by people trying and failing to break a cipher,
Σ2P problems (design a fast cipher free from security flaws) get solved by government-sponsored contests that the whole field participates in, and
Π2P problems (determine that any cipher faster than X can’t be secure) are left unsolved. (The fact that people tried and failed to design a secure cipher faster than X is really weak evidence that it’s impossible and of course nobody knows how to prove anything like this.)
(ETA: This comment may not make much sense unless the reader is familiar with section 2.2 of AI safety via debate.)
At the meta level, given Opt that we have to use as a black box, it seems like:
Building an unaligned AI corresponds to P or at most NP
Verifying that a flaw in a proposed aligned AI design actually is a flaw corresponds to P
Checking whether an AI design is aligned (or has an alignment flaw) corresponds to NP or co-NP
Designing an actually aligned AI corresponds to Σ2P
Determining that aligned AI is impossible corresponds to Π2P
Determining whether there is an aligned AI that comes with a clear argument for it being aligned corresponds to NP
Determining whether there is a clear argument for aligned AI being impossible corresponds to NP
Does that seem right? For the impossibility part you’re proposing to do 7 but since the actual problem is closer to 5, it could easily be the case that aligned AI is impossible but there is no clear argument for it. (I.e., there’s no short argument that can convince you and you have to do the Π2P computation instead.) So I would think that if 6 is false then that actually is (probably) bad news.
As one of the authors of the paper that introduced the idea of human analogues of computational complexity classes, which I’ve found to be really interesting and useful (see here for another place that I use it), I’m curious about your thoughts on the ways I’ve used it (e.g., am I misunderstanding the idea or misusing it) and whether you have any further thoughts about it yourself, such as what kinds of problems you expect to be outside the human equivalent of NP.
Determining whether aligned AI is impossible seems harder than determining whether there is any hope for a knowably-aligned AI.
I think that when a design problem is impossible, there is often an argument for why it’s impossible. Certainly that’s not obvious though, and you might just be out of luck. (That said, it’s also not obvious that problems in NP are easier to solve than Σ2P, both contain problems you just can’t solve and so you are relying on extra structural facts in either case.)
My knowledge/experience is limited but I know in cryptography we generally can’t find arguments for why it’s impossible to design an algorithm to break a cipher, and have to rely on the fact that lots of smart people have tried to attack a cipher and nobody has succeeded. (Sometimes we can have a “security proof” in the sense of a reduction from the security of a cipher to a problem like factoring, but then we’re still relying on the fact that lots of smart people have tried to design faster factoring algorithms and nobody has succeeded.) Note that this is only a Π1P or co-NP problem, whereas determining that aligned AI is impossible is in Π2P according to my analysis.
I guess that may be true in a raw compute sense, but there’s also an issue of how do you convince people that you’ve solved a problem? Again in crypto,
NP problems (find a flaw in a cipher) get solved by individual researchers,
co-NP problems (determine a cipher is secure) is “solved” by people trying and failing to break a cipher,
Σ2P problems (design a fast cipher free from security flaws) get solved by government-sponsored contests that the whole field participates in, and
Π2P problems (determine that any cipher faster than X can’t be secure) are left unsolved. (The fact that people tried and failed to design a secure cipher faster than X is really weak evidence that it’s impossible and of course nobody knows how to prove anything like this.)