dxu comments on Deep Deceptiveness

dxu 24 Mar 2023 21:37 UTC
3 points
0
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.

There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:

And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.

AFAICT, this basically refutes the “combinatorial argument” for poetry being difficult to specify (while not doing the same for something like “deception”), since poetry is in fact not specified anywhere in the system’s explicit objective. (Meanwhile, the corresponding strategy for “deception”—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it’s a small target, but that it has a strange shape, which even prevents us from neatly defining a “convex hull” guaranteed to enclose it.)

However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an “anti-natural” concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there’s a “poem” out there consisting largely of what looks like unmetered prose, which one system classifies as “poetry” and the other doesn’t (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn’t (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you’ll notice, even humans often disagree on what constitutes poetry).

This doesn’t mean that the system can’t write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it’s the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a “misgeneralization” at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes “true poetry”. A “different opinion” about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!

(Actually, the argument I just gave can be viewed as a concrete shadow of the “convex hull” argument I gave initially; what it’s basically saying is that learning “poetry” is like drawing a hypersphere around some sort of convex polytope, whereas learning about “deception” is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape’s volume, but the parts of it you don’t capture matter!)

These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.

I’m not really able to extract a broader point out of this paragraph, sorry. These sentences don’t seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.
- “These biases are quite robust to perturbations, so they can’t be too precise.” I don’t think there’s good evidence for this either way; humans are basically all trained “on-distribution”, so to speak. We don’t have observations for what happens in the case of “large” perturbations (that don’t immediately lead to death or otherwise life-impairing cognitive malfunction). Also, even on-distribution, I don’t know that I describe the resulting behavior as “robust”—see below.
- “And genes are not long enough to encode something too unnatural.” Sure—which is why genes don’t encode things like “don’t deceive others”; instead, they encode proxy emotions like empathy and social reciprocation—which in turn break all the time, for all sorts of reasons. Doesn’t seem like a good model to emulate!
- “And we have billions of examples to help us reverse engineer it.” Billions of examples of what? Reverse engineer what? Again, in the vein of my previous requests: I’d like to see some concreteness, here. There’s a lot of work you’re hiding inside of those abstract-sounding phrases.
- “And we already have similar in some ways architecture working.” I think I straightforwardly don’t know what this is referring to, sorry. Could you give an example or three?
On the whole, my response to this part of your comment is probably best described as “mildly bemused”, with maybe a side helping of “gently skeptical”.

Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?

I think (though I’m not certain) that what you’re trying to say here is that the same arguments I made for “deceiving the operators” being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I… disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn’t look to me like there’s any kind of thread connecting the two.

(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the “avoid deception” task. I don’t think that’s a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)

As for “10 times more honesty training”, well: it’s not clear to me how that would work in practice. I’ve already argued that it’s not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it’s not going to help much. The main issue here isn’t the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn’t break as it grows in capability.

To use a rough analogy: you can’t teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.

(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI’s decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it’s not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)

Anyway, since this comment has become quite long, here’s a short (ChatGPT-assisted) summary of the main points:
1. The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn’t necessarily inform us about the difficulty of learning the other.
2. Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system’s capabilities grow.
3. The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.
4. Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.