A good candidate is the sharp left turn. Alignment techniques that work for sub-human and human-level AIs may well stop working when it starts becoming superhuman.
Verification of alignment plans is probably the biggest one though. We can’t verify alignment proposals from superhuman AI, or human-level AI, or even other humans before trying them out, which may well kill us. I think the best way forward is to hire millions of alignment researchers and hope one of them comes up with a plan that can be verified in a way we don’t know yet.
My personal example for something like this is the Minie Ball + rifled musket. It’s an idea invented in the mid 1800s (after the bolt-action rifle!) that greatly increased the accuracy, range, and lethality of muskets. However, despite the required ideas like rifling being around in 1500 and millions of people working over centuries to improve firearms, this obvious and easily verifiable idea took 300 years to discover. There are plenty of in-hindsight-obvious ideas in AI. I think (hope?) there is something like that on the scale of the Minie Ball for alignment. After all, there have only been <300 people working on it for ~20 years. And much less than 20 for the current neural network paradigm.
I think you can note that even if we don’t fully trust the author behind a proposal for alignment, we can still verify it. For example, if it’s a mathematical proof for alignment, we can verify the accuracy of the proof with automated proof verification and reject anything that’s too complex.
This may not be possible in reality but it’s an example where we don’t really need to trust the proposer.
Formal verification for specific techniques may be possible, and is desirable.
Formal verification for an entire overall plan… Let’s suppose we wanted a formal proof of some basic sanity checks of the plan. For example: if the plan is followed, then, as of 2100, there will be at least 8 billion humans alive and at least as happy and free as they are today. I mean, forget “happy” and “free”—how can you even define a “human” in formal mathematical language? Are they defined as certain arrangements of subatomic particles? Such a definition would presumably be unmanageably long. And if “human” is not defined, then that leaves you open to ending up with, say, 8 billion dumb automatons programmed to repeat “I’m happy and free”.
You might try relying on some preexisting process to decide if something is a human. If it’s a real-world process, like polling a trusted group of humans or sending a query to a certain IP address, this is vulnerable to manipulation of the real world (coercing the humans, hacking the server). You might try giving it a neural net that’s trained to recognize humans—the neural net can be expressed as a precise mathematical object—but then you’re vulnerable to adversarial selection, and might end up with bizarre-looking inanimate objects that the net thinks are human. (Plus there’s the question of exactly how you take a real-world human and get something that’s fed into the neural net. If the input to the net is pixels, then how is the photo taken, and can that be manipulated?)
Keep one’s eyes open for opportunities, I guess, but it seems likely that the scope of formal verification will be extremely limited. I expect it would be most useful in computer security, where the conclusion, the “thing to be proven”, is a statement about objects that have precise definitions. Though they might likely be too long and complex for human verification even then.
http://www.underhanded-c.org/ is a nice illustration of how serious misbehavior can be inserted into innocuous-looking, relatively short snippets of programs (i.e. mathematically precise statements).
To start off, this was just an example of verification being easier than coming up with the plan. There may be different paradigms under which we can verify the plan but the general trend seems to be verification in pretty much anything is easier than generation.
Also to your points:
I think a start would be proving the system truly follows some arbitrary goal. I don’t think learning what arrangement of light patterns consists of a human is that hard for neural nets. This was done a decade ago.
The goal could be defined as a group of typical rational humans would eventually agree the world state is good after considering all evidence and given infinite paper to write down their thoughts. This can obviously fail but I think it’s distinct from obvious paper clipping failures. I can still see failure modes here, but I don’t see why they are certain failures.
I also don’t agree with the bizarre inhuman part. I at least think it’s just an unknown unknown at worst. For example, the “faciest” face looks like a human face.
In addition, the goal won’t have precise definitions because they aren’t precise but we also can’t exclude our neural net understanding some form of the goal. I’m also sure there are mathematical systems that can deal with imprecise goals. For example, QACI is doing this.
I also don’t see what a C program has to do with this. Proving 2+2=4 in math means 2+2=4 whereas a computer program is a set of instructions which you believe give some desired result.
A good candidate is the sharp left turn. Alignment techniques that work for sub-human and human-level AIs may well stop working when it starts becoming superhuman.
Verification of alignment plans is probably the biggest one though. We can’t verify alignment proposals from superhuman AI, or human-level AI, or even other humans before trying them out, which may well kill us. I think the best way forward is to hire millions of alignment researchers and hope one of them comes up with a plan that can be verified in a way we don’t know yet.
My personal example for something like this is the Minie Ball + rifled musket. It’s an idea invented in the mid 1800s (after the bolt-action rifle!) that greatly increased the accuracy, range, and lethality of muskets. However, despite the required ideas like rifling being around in 1500 and millions of people working over centuries to improve firearms, this obvious and easily verifiable idea took 300 years to discover. There are plenty of in-hindsight-obvious ideas in AI. I think (hope?) there is something like that on the scale of the Minie Ball for alignment. After all, there have only been <300 people working on it for ~20 years. And much less than 20 for the current neural network paradigm.
I think you can note that even if we don’t fully trust the author behind a proposal for alignment, we can still verify it. For example, if it’s a mathematical proof for alignment, we can verify the accuracy of the proof with automated proof verification and reject anything that’s too complex.
This may not be possible in reality but it’s an example where we don’t really need to trust the proposer.
Formal verification for specific techniques may be possible, and is desirable.
Formal verification for an entire overall plan… Let’s suppose we wanted a formal proof of some basic sanity checks of the plan. For example: if the plan is followed, then, as of 2100, there will be at least 8 billion humans alive and at least as happy and free as they are today. I mean, forget “happy” and “free”—how can you even define a “human” in formal mathematical language? Are they defined as certain arrangements of subatomic particles? Such a definition would presumably be unmanageably long. And if “human” is not defined, then that leaves you open to ending up with, say, 8 billion dumb automatons programmed to repeat “I’m happy and free”.
You might try relying on some preexisting process to decide if something is a human. If it’s a real-world process, like polling a trusted group of humans or sending a query to a certain IP address, this is vulnerable to manipulation of the real world (coercing the humans, hacking the server). You might try giving it a neural net that’s trained to recognize humans—the neural net can be expressed as a precise mathematical object—but then you’re vulnerable to adversarial selection, and might end up with bizarre-looking inanimate objects that the net thinks are human. (Plus there’s the question of exactly how you take a real-world human and get something that’s fed into the neural net. If the input to the net is pixels, then how is the photo taken, and can that be manipulated?)
Keep one’s eyes open for opportunities, I guess, but it seems likely that the scope of formal verification will be extremely limited. I expect it would be most useful in computer security, where the conclusion, the “thing to be proven”, is a statement about objects that have precise definitions. Though they might likely be too long and complex for human verification even then.
http://www.underhanded-c.org/ is a nice illustration of how serious misbehavior can be inserted into innocuous-looking, relatively short snippets of programs (i.e. mathematically precise statements).
To start off, this was just an example of verification being easier than coming up with the plan. There may be different paradigms under which we can verify the plan but the general trend seems to be verification in pretty much anything is easier than generation.
Also to your points:
I think a start would be proving the system truly follows some arbitrary goal. I don’t think learning what arrangement of light patterns consists of a human is that hard for neural nets. This was done a decade ago.
The goal could be defined as a group of typical rational humans would eventually agree the world state is good after considering all evidence and given infinite paper to write down their thoughts. This can obviously fail but I think it’s distinct from obvious paper clipping failures. I can still see failure modes here, but I don’t see why they are certain failures.
I also don’t agree with the bizarre inhuman part. I at least think it’s just an unknown unknown at worst. For example, the “faciest” face looks like a human face.
In addition, the goal won’t have precise definitions because they aren’t precise but we also can’t exclude our neural net understanding some form of the goal. I’m also sure there are mathematical systems that can deal with imprecise goals. For example, QACI is doing this.
I also don’t see what a C program has to do with this. Proving 2+2=4 in math means 2+2=4 whereas a computer program is a set of instructions which you believe give some desired result.