I think you can note that even if we don’t fully trust the author behind a proposal for alignment, we can still verify it. For example, if it’s a mathematical proof for alignment, we can verify the accuracy of the proof with automated proof verification and reject anything that’s too complex.
This may not be possible in reality but it’s an example where we don’t really need to trust the proposer.
Formal verification for specific techniques may be possible, and is desirable.
Formal verification for an entire overall plan… Let’s suppose we wanted a formal proof of some basic sanity checks of the plan. For example: if the plan is followed, then, as of 2100, there will be at least 8 billion humans alive and at least as happy and free as they are today. I mean, forget “happy” and “free”—how can you even define a “human” in formal mathematical language? Are they defined as certain arrangements of subatomic particles? Such a definition would presumably be unmanageably long. And if “human” is not defined, then that leaves you open to ending up with, say, 8 billion dumb automatons programmed to repeat “I’m happy and free”.
You might try relying on some preexisting process to decide if something is a human. If it’s a real-world process, like polling a trusted group of humans or sending a query to a certain IP address, this is vulnerable to manipulation of the real world (coercing the humans, hacking the server). You might try giving it a neural net that’s trained to recognize humans—the neural net can be expressed as a precise mathematical object—but then you’re vulnerable to adversarial selection, and might end up with bizarre-looking inanimate objects that the net thinks are human. (Plus there’s the question of exactly how you take a real-world human and get something that’s fed into the neural net. If the input to the net is pixels, then how is the photo taken, and can that be manipulated?)
Keep one’s eyes open for opportunities, I guess, but it seems likely that the scope of formal verification will be extremely limited. I expect it would be most useful in computer security, where the conclusion, the “thing to be proven”, is a statement about objects that have precise definitions. Though they might likely be too long and complex for human verification even then.
http://www.underhanded-c.org/ is a nice illustration of how serious misbehavior can be inserted into innocuous-looking, relatively short snippets of programs (i.e. mathematically precise statements).
To start off, this was just an example of verification being easier than coming up with the plan. There may be different paradigms under which we can verify the plan but the general trend seems to be verification in pretty much anything is easier than generation.
Also to your points:
I think a start would be proving the system truly follows some arbitrary goal. I don’t think learning what arrangement of light patterns consists of a human is that hard for neural nets. This was done a decade ago.
The goal could be defined as a group of typical rational humans would eventually agree the world state is good after considering all evidence and given infinite paper to write down their thoughts. This can obviously fail but I think it’s distinct from obvious paper clipping failures. I can still see failure modes here, but I don’t see why they are certain failures.
I also don’t agree with the bizarre inhuman part. I at least think it’s just an unknown unknown at worst. For example, the “faciest” face looks like a human face.
In addition, the goal won’t have precise definitions because they aren’t precise but we also can’t exclude our neural net understanding some form of the goal. I’m also sure there are mathematical systems that can deal with imprecise goals. For example, QACI is doing this.
I also don’t see what a C program has to do with this. Proving 2+2=4 in math means 2+2=4 whereas a computer program is a set of instructions which you believe give some desired result.
I think you can note that even if we don’t fully trust the author behind a proposal for alignment, we can still verify it. For example, if it’s a mathematical proof for alignment, we can verify the accuracy of the proof with automated proof verification and reject anything that’s too complex.
This may not be possible in reality but it’s an example where we don’t really need to trust the proposer.
Formal verification for specific techniques may be possible, and is desirable.
Formal verification for an entire overall plan… Let’s suppose we wanted a formal proof of some basic sanity checks of the plan. For example: if the plan is followed, then, as of 2100, there will be at least 8 billion humans alive and at least as happy and free as they are today. I mean, forget “happy” and “free”—how can you even define a “human” in formal mathematical language? Are they defined as certain arrangements of subatomic particles? Such a definition would presumably be unmanageably long. And if “human” is not defined, then that leaves you open to ending up with, say, 8 billion dumb automatons programmed to repeat “I’m happy and free”.
You might try relying on some preexisting process to decide if something is a human. If it’s a real-world process, like polling a trusted group of humans or sending a query to a certain IP address, this is vulnerable to manipulation of the real world (coercing the humans, hacking the server). You might try giving it a neural net that’s trained to recognize humans—the neural net can be expressed as a precise mathematical object—but then you’re vulnerable to adversarial selection, and might end up with bizarre-looking inanimate objects that the net thinks are human. (Plus there’s the question of exactly how you take a real-world human and get something that’s fed into the neural net. If the input to the net is pixels, then how is the photo taken, and can that be manipulated?)
Keep one’s eyes open for opportunities, I guess, but it seems likely that the scope of formal verification will be extremely limited. I expect it would be most useful in computer security, where the conclusion, the “thing to be proven”, is a statement about objects that have precise definitions. Though they might likely be too long and complex for human verification even then.
http://www.underhanded-c.org/ is a nice illustration of how serious misbehavior can be inserted into innocuous-looking, relatively short snippets of programs (i.e. mathematically precise statements).
To start off, this was just an example of verification being easier than coming up with the plan. There may be different paradigms under which we can verify the plan but the general trend seems to be verification in pretty much anything is easier than generation.
Also to your points:
I think a start would be proving the system truly follows some arbitrary goal. I don’t think learning what arrangement of light patterns consists of a human is that hard for neural nets. This was done a decade ago.
The goal could be defined as a group of typical rational humans would eventually agree the world state is good after considering all evidence and given infinite paper to write down their thoughts. This can obviously fail but I think it’s distinct from obvious paper clipping failures. I can still see failure modes here, but I don’t see why they are certain failures.
I also don’t agree with the bizarre inhuman part. I at least think it’s just an unknown unknown at worst. For example, the “faciest” face looks like a human face.
In addition, the goal won’t have precise definitions because they aren’t precise but we also can’t exclude our neural net understanding some form of the goal. I’m also sure there are mathematical systems that can deal with imprecise goals. For example, QACI is doing this.
I also don’t see what a C program has to do with this. Proving 2+2=4 in math means 2+2=4 whereas a computer program is a set of instructions which you believe give some desired result.