It seems to me that the surprising simplicity of current-generation ML algorithms is a big part of the problem.
As a thought experiment: suppose you had a human brain, with the sort of debug access you’d have with a neural net; ie, you could see all the connections, edge weights, and firings, and had a decent multiple of the compute the brain has. Could you extract something like a verbal inner monologue, a text stream that was strongly predictive of that human’s plans? I don’t think it would be trivial, but my guess is that you could. It wouldn’t hold up against a meditator optimizing against you, but it would be a solid starting point.
Could you do the same thing to GPT-3? No; you can’t get language out of it that predicts its plans, because it doesn’t have plans. Could you do the same thing to AlphaZero? No, you can’t get language out of it that predicts its plans, because it doesn’t use language.
This analogy makes me think neural net transparency might not be as doomed as the early results would suggest; they aren’t finding human-legible low-dimensional representations of things because those representations aren’t present (GPT-3) or have nothing human-legible to match up to (AlphaZero).
In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I’d feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors. This isn’t by itself sufficient for alignment, of course, but it’d make the problem look a lot more tractable.
I’m not sure whether humans having an inner monologue that looks like the language we trained on and predicts our future behavior is an incidental fact about humans, or a convergent property of intelligent systems that get most of their information from language, or a convergent property of all intelligent systems, or something that would require deliberate architecture choices to make happen, or something that we won’t be able to make happen even with deliberate architecture choices. From my currents state of knowledge, none of these would surprise me much.
In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I’d feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors.
I have an optional internal monologue, and programming or playing strategy games is usually a non-verbal exercise.
I’m sure you could in principle (though not as described!) map neuron firings to a strongly predictive text stream regardless, but I don’t think that would be me. And the same intuition says it would be possible for MuZero; this is about the expressiveness of text rather than monologue being a key component of cognition or identity. Conversely, I would expect this to go terribly wrong when the tails come apart, because we’re talking about correlates rather than causal structures, with all the usual problems.
I don’t think the verbal/pre-verbal stream of consciousness that describes our behavior to ourselves is identical with ourselves. But I do think our brain exploits it to exert feedback on its unconscious behavior, and that’s a large part of how our morality works. So maybe this is still relevant for AI safety.
It seems to me that the surprising simplicity of current-generation ML algorithms is a big part of the problem.
As a thought experiment: suppose you had a human brain, with the sort of debug access you’d have with a neural net; ie, you could see all the connections, edge weights, and firings, and had a decent multiple of the compute the brain has. Could you extract something like a verbal inner monologue, a text stream that was strongly predictive of that human’s plans? I don’t think it would be trivial, but my guess is that you could. It wouldn’t hold up against a meditator optimizing against you, but it would be a solid starting point.
Could you do the same thing to GPT-3? No; you can’t get language out of it that predicts its plans, because it doesn’t have plans. Could you do the same thing to AlphaZero? No, you can’t get language out of it that predicts its plans, because it doesn’t use language.
This analogy makes me think neural net transparency might not be as doomed as the early results would suggest; they aren’t finding human-legible low-dimensional representations of things because those representations aren’t present (GPT-3) or have nothing human-legible to match up to (AlphaZero).
In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I’d feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors. This isn’t by itself sufficient for alignment, of course, but it’d make the problem look a lot more tractable.
I’m not sure whether humans having an inner monologue that looks like the language we trained on and predicts our future behavior is an incidental fact about humans, or a convergent property of intelligent systems that get most of their information from language, or a convergent property of all intelligent systems, or something that would require deliberate architecture choices to make happen, or something that we won’t be able to make happen even with deliberate architecture choices. From my currents state of knowledge, none of these would surprise me much.
Wish granted!
I have an optional internal monologue, and programming or playing strategy games is usually a non-verbal exercise.
I’m sure you could in principle (though not as described!) map neuron firings to a strongly predictive text stream regardless, but I don’t think that would be me. And the same intuition says it would be possible for MuZero; this is about the expressiveness of text rather than monologue being a key component of cognition or identity. Conversely, I would expect this to go terribly wrong when the tails come apart, because we’re talking about correlates rather than causal structures, with all the usual problems.
I don’t think the verbal/pre-verbal stream of consciousness that describes our behavior to ourselves is identical with ourselves. But I do think our brain exploits it to exert feedback on its unconscious behavior, and that’s a large part of how our morality works. So maybe this is still relevant for AI safety.