Note: the following essay is very much my opinion. Should you trust my opinion? Probably not too much. Instead, just record it as a data point of the form “this is what one person with a background in formal mathematics and cryptography who has been doing machine learning on real-world problems for over a decade thinks.” Depending on your opinion on the relevance of math, cryptography and the importance of using machine learning “in anger” (to solve real world problems), that might be a useful data point or not.
So, without further ado: A list of possible alignment strategies (and how likely they are to work)
Edit (05/05/2022): Added “Tool AIs” section, and polls.
Formal Mathematical Proof
This refers to a whole class of alignment strategies where you define (in a formal mathematical sense) a set of properties you would like an aligned AI to have, and then you mathematically prove that an AI architectured a certain way possesses these properties.
For example, you may want an AI with a stop button, so that humans can always turn them off if the AI goes rogue. Or you may want an AI that will never convert more than 1% of the Earth’s surface into computronium. So long as a property can be defined in a formal mathematical sense, you can imagine writing a formal proof that a certain type of system will never violate that property.
How likely is this to work?
Not at all. It won’t work.
There is a aphorism in the field of Cryptography: Any cryptographic system formally proven to be secure… isn’t.
The problem is, when attempting to formally define a system, you will make assumptions and sooner or later one of those assumptions will turn out to be wrong. One-time-pad turns out to be two-time-pad. Black-boxes turn out to have side-channels. That kind of thing. Formal proofs never ever work out in the real world. The exception that proves the rule is, of course, P=NP. All cryptographic systems (other than one-time-pad) rely on the assumption that P!=NP, but this is famously unproven.
There is an additional problem. Namely, competition. All of the fancy formal-proof stuff tends to make computers much slower. For example, fully holomorphic encryption is millions of times slower than just computing on raw data. So if two people are trying to build an AI and one of them is relying on formal proofs, the other person is going to finish first and with a much more powerful AI to boot.
Various Alignment Strategies (and how likely they are to work)
Note: the following essay is very much my opinion. Should you trust my opinion? Probably not too much. Instead, just record it as a data point of the form “this is what one person with a background in formal mathematics and cryptography who has been doing machine learning on real-world problems for over a decade thinks.” Depending on your opinion on the relevance of math, cryptography and the importance of using machine learning “in anger” (to solve real world problems), that might be a useful data point or not.
So, without further ado: A list of possible alignment strategies (and how likely they are to work)
Edit (05/05/2022): Added “Tool AIs” section, and polls.
Formal Mathematical Proof
This refers to a whole class of alignment strategies where you define (in a formal mathematical sense) a set of properties you would like an aligned AI to have, and then you mathematically prove that an AI architectured a certain way possesses these properties.
For example, you may want an AI with a stop button, so that humans can always turn them off if the AI goes rogue. Or you may want an AI that will never convert more than 1% of the Earth’s surface into computronium. So long as a property can be defined in a formal mathematical sense, you can imagine writing a formal proof that a certain type of system will never violate that property.
How likely is this to work?
Not at all. It won’t work.
There is a aphorism in the field of Cryptography: Any cryptographic system formally proven to be secure… isn’t.
The problem is, when attempting to formally define a system, you will make assumptions and sooner or later one of those assumptions will turn out to be wrong. One-time-pad turns out to be two-time-pad. Black-boxes turn out to have side-channels. That kind of thing. Formal proofs never ever work out in the real world. The exception that proves the rule is, of course, P=NP. All cryptographic systems (other than one-time-pad) rely on the assumption that P!=NP, but this is famously unproven.
There is an additional problem. Namely, competition. All of the fancy formal-proof stuff tends to make computers much slower. For example, fully holomorphic encryption is millions of times slower than just computing on raw data. So if two people are trying to build an AI and one of them is relying on formal proofs, the other person is going to finish first and with a much more powerful AI to boot.
Poll