> I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% “don’t think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs” and 2% “actually think about this dangerous topic but please don’t come up with a strategy inside it that kills us”.
Some ways that it’s hard to make a mind not think about certain things: 1. Entanglement. 1.1. Things are entangled with other things. --Things are causally entangled. X causes Y, Y causes X, Z causes X and Y, X and Y cause Z and you’ve conditioned on Z; and chains of these. --Things are constitutively entangled. A computer is a computer and is also usually wires, so thinking about computers entails thinking about wires, and vice versa. --Things are telically entangled; X serves the purpose Y or vice versa, X and Y serve the same purpose, X and Y are served by the same Z. --Things are logically entangled; the way X works may be like the way Y works, so understanding X also constitutes understanding Y to some extent. 1.2. When you think about X, and X is entangled with Y, you also think about Y. 2. Meta-level transfer. If you think about X, part of how you do that is by inventing and refining ways to think in general; even if you aren’t thinking about Y, this makes it much easier to later think about Y. (This is maybe just an extreme form of logical entanglement, but feels less “about the things themselves”.) 3. The need for pointers. As they say: Don’t think about pink elephants. To specify what it is that the AI is supposed to not think about, you have to say *something*; how do you point at the forbidden thoughts, in a way that’s robust across all contexts and all possible conceptual factorings of the domain, without already spinning up instances of the forbidden thoughts?
-------
Some ML experiments someone could run: 1. Train a NN A to recognize images, but at the same time, train its weights so that its activations can’t be used to distinguish Xs from Ys (where X and Y are categories in the recognition task, say). That is, you train a NN B that, given image x, takes A(x)’s activations as inputs, and tries to predict whether x is X or Y; then you update the weights of A along the gradient that decreases B’s accuracy at that task (so, you freeze the weights of B for this step). When does this converge at all? Can you get it to converge so that A is SOTA on all inputs that aren’t Xs or Ys, and it assigns Xs and Ys to X or Y randomly?
2. Similar setup, but now B is just an ordinary X vs. Y classifier, and we train A so that you can’t predict any of B’s activations*. Does A end up being able to distinguish Xs from Ys? (Probably, and this speaks to the pointer problem; just saying, don’t think about stuff like such-and-so (e.g. the activations of B), isn’t yet enough to actually not think about it.
*Say, with a linear map, or whatever. Well, maybe we want to exclude the last layer of B or something, since that’s close to just training A to not be able to recognize X vs. Y.
3. Train A to recognize all the images, except train it (in some way) to not be able to distinguish Xs from Ys. Now, see how much additional fine-tuning is needed to further train this trained A to predict Xs and Ys (now without the anti-training). Entanglement predicts that there’s not much further training needed.
> I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% “don’t think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs” and 2% “actually think about this dangerous topic but please don’t come up with a strategy inside it that kills us”.
Some ways that it’s hard to make a mind not think about certain things:
1. Entanglement.
1.1. Things are entangled with other things.
--Things are causally entangled. X causes Y, Y causes X, Z causes X and Y, X and Y cause Z and you’ve conditioned on Z; and chains of these.
--Things are constitutively entangled. A computer is a computer and is also usually wires, so thinking about computers entails thinking about wires, and vice versa.
--Things are telically entangled; X serves the purpose Y or vice versa, X and Y serve the same purpose, X and Y are served by the same Z.
--Things are logically entangled; the way X works may be like the way Y works, so understanding X also constitutes understanding Y to some extent.
1.2. When you think about X, and X is entangled with Y, you also think about Y.
2. Meta-level transfer. If you think about X, part of how you do that is by inventing and refining ways to think in general; even if you aren’t thinking about Y, this makes it much easier to later think about Y. (This is maybe just an extreme form of logical entanglement, but feels less “about the things themselves”.)
3. The need for pointers. As they say: Don’t think about pink elephants. To specify what it is that the AI is supposed to not think about, you have to say *something*; how do you point at the forbidden thoughts, in a way that’s robust across all contexts and all possible conceptual factorings of the domain, without already spinning up instances of the forbidden thoughts?
-------
Some ML experiments someone could run:
1. Train a NN A to recognize images, but at the same time, train its weights so that its activations can’t be used to distinguish Xs from Ys (where X and Y are categories in the recognition task, say). That is, you train a NN B that, given image x, takes A(x)’s activations as inputs, and tries to predict whether x is X or Y; then you update the weights of A along the gradient that decreases B’s accuracy at that task (so, you freeze the weights of B for this step). When does this converge at all? Can you get it to converge so that A is SOTA on all inputs that aren’t Xs or Ys, and it assigns Xs and Ys to X or Y randomly?
2. Similar setup, but now B is just an ordinary X vs. Y classifier, and we train A so that you can’t predict any of B’s activations*. Does A end up being able to distinguish Xs from Ys? (Probably, and this speaks to the pointer problem; just saying, don’t think about stuff like such-and-so (e.g. the activations of B), isn’t yet enough to actually not think about it.
*Say, with a linear map, or whatever. Well, maybe we want to exclude the last layer of B or something, since that’s close to just training A to not be able to recognize X vs. Y.
3. Train A to recognize all the images, except train it (in some way) to not be able to distinguish Xs from Ys. Now, see how much additional fine-tuning is needed to further train this trained A to predict Xs and Ys (now without the anti-training). Entanglement predicts that there’s not much further training needed.