So, I do think there’s an asymmetry here, in that I mostly expect “avoid deception” is a less natural category than “invent nanotech”, and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables.
Have it been quantitatively argued somewhere at all why such naturalness matters? Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something? Because it does work in humans. And “invent nanotech” or “write poetry” are also small targets and training works for them.
Have it been quantitatively argued somewhere at all why such naturalness matters?
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it’s literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of “privileged” abstractions.
In this frame, misgeneralization is what happens when your (non-combinatorially vast) training data fails to specify a particular concept, and you end up learning an alternative abstraction that is consistent with the data, but doesn’t generalize as expected. This is why naturalness matters: because the more “natural” a category or abstraction is, the more likely it is to be one of those privileged abstractions that can be learned from a relatively small amount of data.
Of course, that doesn’t establish that “deceptive behavior” is an unnatural category per se—but I would argue that our inability to pin down a precise definition of deceptive behavior, along with the complexity and context-dependency of the concept, suggests that it may not be one of those privileged, natural abstractions. In other words, learning to avoid deceptive behavior might require a lot more data and nuanced understanding than learning more natural categories—and unfortunately, neither of those seem (to me) to be very easily achievable!
(See also: previous comment. :-P)
Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something?
Having read my above response, it should (hopefully) be predictable enough what I’m going to say here The bluntest version of my response might take the form of a pair of questions: whence the training data? And whence the bias?
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
The best I can manage, in practice, is to imagine a reward function with simplistic deception-predicates hooked up to negative rewards, which basically zaps the system every time it thinks a thought matching one (or more) of the predicates. But as I noted in my previous comment(s), all this approach seems likely to achieve is instilling a set of “flinch-like” reflexes into the system—and I think such reflexes are unlikely to unfold into any kind of reflectively stable (“ego-syntonic”, in Steven’s terms) desire to avoid deceptive/manipulative behavior.
Because it does work in humans.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior. (But also: even with this, humans don’t reliably avoid deceiving other humans!)
And “invent nanotech” or “write poetry” are also small targets and training works for them.
Well, notably not “invent nanotech” (not yet, anyway :-P). And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
(Situating “avoid deception” as part of a larger task, meanwhile, seems like a harder ask.)
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior.
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:
And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
AFAICT, this basically refutes the “combinatorial argument” for poetry being difficult to specify (while not doing the same for something like “deception”), since poetry is in fact not specified anywhere in the system’s explicit objective. (Meanwhile, the corresponding strategy for “deception”—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it’s a small target, but that it has a strange shape, which even prevents us from neatly defining a “convex hull” guaranteed to enclose it.)
However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an “anti-natural” concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there’s a “poem” out there consisting largely of what looks like unmetered prose, which one system classifies as “poetry” and the other doesn’t (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn’t (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you’ll notice, even humans often disagree on what constitutes poetry).
This doesn’t mean that the system can’t write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it’s the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a “misgeneralization” at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes “true poetry”. A “different opinion” about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!
(Actually, the argument I just gave can be viewed as a concrete shadow of the “convex hull” argument I gave initially; what it’s basically saying is that learning “poetry” is like drawing a hypersphere around some sort of convex polytope, whereas learning about “deception” is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape’s volume, but the parts of it you don’t capture matter!)
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
I’m not really able to extract a broader point out of this paragraph, sorry. These sentences don’t seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.
“These biases are quite robust to perturbations, so they can’t be too precise.” I don’t think there’s good evidence for this either way; humans are basically all trained “on-distribution”, so to speak. We don’t have observations for what happens in the case of “large” perturbations (that don’t immediately lead to death or otherwise life-impairing cognitive malfunction). Also, even on-distribution, I don’t know that I describe the resulting behavior as “robust”—see below.
“And genes are not long enough to encode something too unnatural.” Sure—which is why genes don’t encode things like “don’t deceive others”; instead, they encode proxy emotions like empathy and social reciprocation—which in turn break all the time, for all sorts of reasons. Doesn’t seem like a good model to emulate!
“And we have billions of examples to help us reverse engineer it.” Billions of examples of what? Reverse engineer what? Again, in the vein of my previous requests: I’d like to see some concreteness, here. There’s a lot of work you’re hiding inside of those abstract-sounding phrases.
“And we already have similar in some ways architecture working.” I think I straightforwardly don’t know what this is referring to, sorry. Could you give an example or three?
On the whole, my response to this part of your comment is probably best described as “mildly bemused”, with maybe a side helping of “gently skeptical”.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
I think (though I’m not certain) that what you’re trying to say here is that the same arguments I made for “deceiving the operators” being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I… disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn’t look to me like there’s any kind of thread connecting the two.
(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the “avoid deception” task. I don’t think that’s a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)
As for “10 times more honesty training”, well: it’s not clear to me how that would work in practice. I’ve already argued that it’s not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it’s not going to help much. The main issue here isn’t the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn’t break as it grows in capability.
To use a rough analogy: you can’t teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI’s decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it’s not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)
Anyway, since this comment has become quite long, here’s a short (ChatGPT-assisted) summary of the main points:
The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn’t necessarily inform us about the difficulty of learning the other.
Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system’s capabilities grow.
The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.
Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
Have it been quantitatively argued somewhere at all why such naturalness matters? Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something? Because it does work in humans. And “invent nanotech” or “write poetry” are also small targets and training works for them.
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it’s literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of “privileged” abstractions.
In this frame, misgeneralization is what happens when your (non-combinatorially vast) training data fails to specify a particular concept, and you end up learning an alternative abstraction that is consistent with the data, but doesn’t generalize as expected. This is why naturalness matters: because the more “natural” a category or abstraction is, the more likely it is to be one of those privileged abstractions that can be learned from a relatively small amount of data.
Of course, that doesn’t establish that “deceptive behavior” is an unnatural category per se—but I would argue that our inability to pin down a precise definition of deceptive behavior, along with the complexity and context-dependency of the concept, suggests that it may not be one of those privileged, natural abstractions. In other words, learning to avoid deceptive behavior might require a lot more data and nuanced understanding than learning more natural categories—and unfortunately, neither of those seem (to me) to be very easily achievable!
(See also: previous comment. :-P)
Having read my above response, it should (hopefully) be predictable enough what I’m going to say here The bluntest version of my response might take the form of a pair of questions: whence the training data? And whence the bias?
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
The best I can manage, in practice, is to imagine a reward function with simplistic deception-predicates hooked up to negative rewards, which basically zaps the system every time it thinks a thought matching one (or more) of the predicates. But as I noted in my previous comment(s), all this approach seems likely to achieve is instilling a set of “flinch-like” reflexes into the system—and I think such reflexes are unlikely to unfold into any kind of reflectively stable (“ego-syntonic”, in Steven’s terms) desire to avoid deceptive/manipulative behavior.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior. (But also: even with this, humans don’t reliably avoid deceiving other humans!)
Well, notably not “invent nanotech” (not yet, anyway :-P). And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
(Situating “avoid deception” as part of a larger task, meanwhile, seems like a harder ask.)
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:
AFAICT, this basically refutes the “combinatorial argument” for poetry being difficult to specify (while not doing the same for something like “deception”), since poetry is in fact not specified anywhere in the system’s explicit objective. (Meanwhile, the corresponding strategy for “deception”—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it’s a small target, but that it has a strange shape, which even prevents us from neatly defining a “convex hull” guaranteed to enclose it.)
However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an “anti-natural” concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there’s a “poem” out there consisting largely of what looks like unmetered prose, which one system classifies as “poetry” and the other doesn’t (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn’t (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you’ll notice, even humans often disagree on what constitutes poetry).
This doesn’t mean that the system can’t write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it’s the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a “misgeneralization” at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes “true poetry”. A “different opinion” about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!
(Actually, the argument I just gave can be viewed as a concrete shadow of the “convex hull” argument I gave initially; what it’s basically saying is that learning “poetry” is like drawing a hypersphere around some sort of convex polytope, whereas learning about “deception” is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape’s volume, but the parts of it you don’t capture matter!)
I’m not really able to extract a broader point out of this paragraph, sorry. These sentences don’t seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.
“These biases are quite robust to perturbations, so they can’t be too precise.” I don’t think there’s good evidence for this either way; humans are basically all trained “on-distribution”, so to speak. We don’t have observations for what happens in the case of “large” perturbations (that don’t immediately lead to death or otherwise life-impairing cognitive malfunction). Also, even on-distribution, I don’t know that I describe the resulting behavior as “robust”—see below.
“And genes are not long enough to encode something too unnatural.” Sure—which is why genes don’t encode things like “don’t deceive others”; instead, they encode proxy emotions like empathy and social reciprocation—which in turn break all the time, for all sorts of reasons. Doesn’t seem like a good model to emulate!
“And we have billions of examples to help us reverse engineer it.” Billions of examples of what? Reverse engineer what? Again, in the vein of my previous requests: I’d like to see some concreteness, here. There’s a lot of work you’re hiding inside of those abstract-sounding phrases.
“And we already have similar in some ways architecture working.” I think I straightforwardly don’t know what this is referring to, sorry. Could you give an example or three?
On the whole, my response to this part of your comment is probably best described as “mildly bemused”, with maybe a side helping of “gently skeptical”.
I think (though I’m not certain) that what you’re trying to say here is that the same arguments I made for “deceiving the operators” being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I… disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn’t look to me like there’s any kind of thread connecting the two.
(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the “avoid deception” task. I don’t think that’s a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)
As for “10 times more honesty training”, well: it’s not clear to me how that would work in practice. I’ve already argued that it’s not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it’s not going to help much. The main issue here isn’t the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn’t break as it grows in capability.
To use a rough analogy: you can’t teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI’s decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it’s not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)
Anyway, since this comment has become quite long, here’s a short (ChatGPT-assisted) summary of the main points:
The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn’t necessarily inform us about the difficulty of learning the other.
Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system’s capabilities grow.
The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.
Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.