Sure. Let’s assume an AI that uses model-based RL of a similar flavor as (I believe) is used in human brains.
Step 1: The thought “I am touching the hot stove” becomes aversive because it’s what I was thinking when I touched the hot stove, which caused pain etc. For details see my discussion of “credit assignment” here.
Step 2A: The thought “I desire to touch the hot stove” also becomes aversive because of its intimate connection to “I am touching the hot stove” from Step 1 above—i.e., in reality, if I desire to touch the hot stove, then it’s much more likely that I will in fact touch the hot stove, and my internal models are sophisticated enough to have picked up on that fact.
Mechanistically, this can happen in either the “forward direction” (when I think “I desire to touch the hot stove”, my brain’s internal models explore the consequences, and that weakly activates “I am touching the hot stove” neurons, which in turn trigger aversion), or in the “backward direction” (involving credit assignment & TD learning, see the diagram about habit-formation here).
Anyway, if the thought “I desire to touch the hot stove” indeed becomes aversive, that’s a kind of meta-preference—a desire not to have a desire.
Step 2B: Conversely and simultaneously, the thought “I desire to not touch the hot stove” becomes appealing for basically similar reasons, and that’s another meta-preference (desire to have a desire).
Happy to discuss more details; see also §10.5.4 here.
Some clarifications:
I’m not listing every desire & meta-preference that occurs in this hot-stove story, just a small subset of them, and other desires (and meta-desires) that I didn’t mention could even be pushing in the opposite direction.
I’m not listing the only (or necessary even primary) way that meta-preferences can arise, just one of the ways. In particular, in neurotypical humans, my hunch is that the strongest meta-preferences tend to arise (directly or indirectly) from social instincts, and I’m currently hazy on the mechanistic details of that.
If you tell me that you’d like to make an AGI with meta-preference X, and you ask me what procedure to follow such that this will definitely happen, my answer right now is basically “I don’t know, sorry”.
So, when I try to translate this line of thinking into the context of deception (or other instrumentally undesirable behaviors), I notice that I mostly can’t tell what “touching the hot stove” ends up corresponding to. This might seem like a nitpick, but I think it’s actually quite a crucial distinction: by substituting a complex phenomenon like deceptive (manipulative) behavior for a simpler (approximately atomic) action like “touching a hot stove”, I think your analogy has elided some important complexities that arise specifically in the context of deception (strategic operator-manipulation).
When it comes to deception (strategic operator-manipulation), the “hot stove” equivalent isn’t a single, easily identifiable action or event; instead, it’s a more abstract concept that manifests in various forms and contexts. In practice, I would initially expect the “hot stove” flinches the system experiences to correspond to whatever (simplistic) deception-predicates were included in its reward function. Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are “pointing at” in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like “touching a hot stove”.
In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of “reflection” and arrive at an endorsed desire.
But when I lay things out like this, I notice that my intuition quite concretely expects that this process will not shake out in a safe way. I expect the system to notice the true fact that [whatever object-level goals it may have] are being impeded by the “hot stove” flinches, and that it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
(Incidentally, my Nate model agrees quite strongly with the above, and considers it a strong reason why he views this kind of reflection as “inherently” dangerous.)
Based on what you wrote in your bullet points, I take it you don’t necessarily disagree with anything I just wrote (hence your talk of being “hazy on the mechanistic details” and “I don’t know, sorry” being your current answer to making an AGI with a certain meta-preference). It’s plausible to me that our primary disagreement here stems from my being substantially less optimistic about these details being solvable via “simple” methods.
it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.
If you’re saying that this is a possible failure mode, yes I agree.
If you’re saying that this is an inevitable failure mode, that’s at least not obvious to me.
I don’t see why two desires that trade off against each other can’t possibly stay balanced in a reflectively-stable way. Happy to dive into details on that. For example, if a rational agent has utility function log(A)+log(B) (or sorta-equivalently, A×B), then the agent will probably split its time / investment between A & B, and that’s sorta an existence proof that you can have a reflectively-stable agent that “desires two different things”, so to speak. See also a bit of discussion here.
I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?
In my mind, it should be basically symmetric:
Pursuing a desire to be non-deceptive makes it harder to invent nanotech.
Pursuing a desire to invent nanotech makes it harder to be non-deceptive.
One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.
[deception is] a more abstract concept that manifests in various forms and context…
I go through an example with the complex messy concept of “human flourishing” in this post. But don’t expect to find any thoughtful elegant solution! :-P Here’s the relevant part:
Q: What about ontological crises / what Stuart Armstrong calls “Concept Extrapolation” / what Scott Alexander calls “the tails coming apart”? In other words, as the AGI learns more and/or considers out-of-distribution plans, it might come find that the web-of-associations corresponding to the “human flourishing” concept are splitting apart. Then what does it do?
A: I talk about that much more in §14.4 here, but basically I don’t know. The plan here is to just hope for the best. More specifically: As the AGI learns new things about the world, and as the world itself changes, the “human flourishing” concept will stop pointing to a coherent “cluster in thingspace”, and the AGI will decide somehow or other what it cares about, in its new understanding of the world. According to the plan discussed in this blog post, we have no control over how that process will unfold and where it will end up. Hopefully somewhere good, but who knows?
Thanks again for responding! My response here is going to be out-of-order w.r.t. your comment, as I think the middle part here is actually the critical bit:
I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?
In my mind, it should be basically symmetric:
Pursuing a desire to be non-deceptive makes it harder to invent nanotech.
Pursuing a desire to invent nanotech makes it harder to be non-deceptive.
One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.
So, I do think there’s an asymmetry here, in that I mostly expect “avoid deception” is a less natural category than “invent nanotech”, and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables. That claim is a little abstract for my taste, so here’s an attempt to convey a more concrete feel for the intuition behind it:
Early on during training (when the system can’t really be characterized as “trying” to do anything), I expect that a naive attempt at training against deception, while simultaneously training towards an object-level goal like “invent nanotech” (or, perhaps more concretely, “engage in iterative experimentation with the goal of synthesizing proteins well-suited for task X”), will involve a reward function that looks a whole lot more like an “invent nanotech” reward function, plus a bunch of deception-predicates that apply negative reward (“flinches”) to matching thoughts, than it will an “avoid deception” reward function, plus a bunch of “invent nanotech”-predicates that apply reward based on… I’m not even sure what the predicates in question would look like, actually.
I think this evinces a deep difference between “avoid deceptive behavior” and “invent nanotech”, whose True Name might be something like… the former is an injunction against a large category of possible behaviors, whereas the latter is an exhortation towards a concrete goal (while proposing few-to-no constraints on the path toward said goal). Insofar as I expect specifying a concrete goal to be easier than specifying a whole category of behaviors (especially when the category in question may not be at all “natural”), I think I likewise expect reward functions attempting to do both things at once to be much better at actually zooming in on something like “invent nanotech”, while being limited to doing “flinch-like” things for “don’t manipulate your operators”—which would, in practice, result in a reward function that looks basically like what I described above.
I think, with this explanation in hand, I feel better equipped to go back and address the first part of your comment:
Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.
I mostly don’t think I want to describe an AGI trained to invent nanotech while avoiding deceptive/manipulative behavior as “an AGI that simultaneously [desires to invent nanotech] and [desires not to deceive its operators]”. Insofar as I expect an AGI trained that way to end up with “desires” we might characterize as “reflective, endorsed, and coherent”, I mostly don’t expect any “flinch-like” reflexes instilled during training to survive reflection and crystallize into anything at all.
I would instead say: a nascent AGI has no (reflective) desires to begin with, and as its cognition is shaped during training, it acquires various cognitive strategies in response to that training, some of which might be characterized as “strategic”, and others of which might be characterized as “reflexive”—and I expect the former to have a much better chance than the latter of making it into the AGI’s ultimate values.
More concretely, I continue to endorse this description (from my previous comment) of what I expect to happen to an AGI system working on assembling itself into a coherent agent:
Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are “pointing at” in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like “touching a hot stove”.
In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of “reflection” and arrive at an endorsed desire.
That reflection process, on my model, is a difficult gauntlet to pass through (I actually think we observe this to some extent even for humans!), and many reflexive (flinch-like) behaviors don’t make it through the gauntlet at all. It’s for this reason that I think the plan you describe in that quoted Q&A is… maybe not totally doomed (though my Eliezer and Nate models certainly think so!), but still mostly doomed.
I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for various reasons, but it’s one of the things I’ve been thinking about. (My “plan” here, such as it is, doesn’t really look like that.)
Hmm, reading between the lines, I wonder if your intuitions are sensing a more general asymmetry where rewards are generally more likely to lead to reflectively-endorsed preferences than punishments? If so, that seems pretty plausible to me, at least other things equal.
Mechanistically: If “nanotech” has positive valence, than “I am working on nanotech” would inherit some positive valence too (details—see the paragraph that starts “slightly more detail”), as would brainstorming how to get nanotech, etc. Whereas if “deception” has negative valence (a.k.a. is aversive), then the very act of brainstorming whether something might be deceptive would itself be somewhat aversive, again for reasons mentioned here.
This is kinda related to confirmation bias. If the idea “my plan will fail” or “I’m wrong” is aversive, then “brainstorming how my plan might fail” or “brainstorming why I’m wrong” is somewhat aversive too. So people don’t do it. It’s just a deficiency of this kind of algorithm. It’s obviously not a fatal deficiency—at least some humans, sometimes, avoid confirmation bias. Basically, I think the trained model can learn a meta-heuristic that recognizes these situations (at least sometimes) and strongly votes to brainstorm anyway.
By the same token, I think it is true that the human brain RL algorithm has a default behavior of being less effective at avoiding punishments than seeking out rewards, because, again, brainstorming how to avoid punishments is aversive, and brainstorming how to get rewards is pleasant. (And the reflectively-endorsed-desire thing is a special case of that, or at least closely related.)
This deficiency in the algorithm might get magically patched over by a learned meta-heuristic, in which case maybe a set of punishments could lead to a reflectively-endorsed preference despite the odds stacked against it. We can also think about how to mitigate that problem by “rewarding the algorithm for acting virtuously” rather than punishing it for acting deceptively, or whatever.
(NB: I called that aspect of the brain algorithm a “deficiency” rather than “flaw” or “bug” because I don’t think it’s fixable without losing essential aspects of intelligence. I think the only way to get a “rational” AGI without confirmation bias etc. is to have the AGI read the Sequences, or rediscover the same ideas, or whatever, same as us humans, thus patching over all the algorithmic quirks with learned meta-heuristics. I think this is an area where I disagree with Nate & Eliezer.)
Yeah, thanks for engaging with me! You’ve definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don’t have fully put-together thoughts on that yet.)
So, I do think there’s an asymmetry here, in that I mostly expect “avoid deception” is a less natural category than “invent nanotech”, and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables.
Have it been quantitatively argued somewhere at all why such naturalness matters? Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something? Because it does work in humans. And “invent nanotech” or “write poetry” are also small targets and training works for them.
Have it been quantitatively argued somewhere at all why such naturalness matters?
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it’s literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of “privileged” abstractions.
In this frame, misgeneralization is what happens when your (non-combinatorially vast) training data fails to specify a particular concept, and you end up learning an alternative abstraction that is consistent with the data, but doesn’t generalize as expected. This is why naturalness matters: because the more “natural” a category or abstraction is, the more likely it is to be one of those privileged abstractions that can be learned from a relatively small amount of data.
Of course, that doesn’t establish that “deceptive behavior” is an unnatural category per se—but I would argue that our inability to pin down a precise definition of deceptive behavior, along with the complexity and context-dependency of the concept, suggests that it may not be one of those privileged, natural abstractions. In other words, learning to avoid deceptive behavior might require a lot more data and nuanced understanding than learning more natural categories—and unfortunately, neither of those seem (to me) to be very easily achievable!
(See also: previous comment. :-P)
Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something?
Having read my above response, it should (hopefully) be predictable enough what I’m going to say here The bluntest version of my response might take the form of a pair of questions: whence the training data? And whence the bias?
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
The best I can manage, in practice, is to imagine a reward function with simplistic deception-predicates hooked up to negative rewards, which basically zaps the system every time it thinks a thought matching one (or more) of the predicates. But as I noted in my previous comment(s), all this approach seems likely to achieve is instilling a set of “flinch-like” reflexes into the system—and I think such reflexes are unlikely to unfold into any kind of reflectively stable (“ego-syntonic”, in Steven’s terms) desire to avoid deceptive/manipulative behavior.
Because it does work in humans.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior. (But also: even with this, humans don’t reliably avoid deceiving other humans!)
And “invent nanotech” or “write poetry” are also small targets and training works for them.
Well, notably not “invent nanotech” (not yet, anyway :-P). And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
(Situating “avoid deception” as part of a larger task, meanwhile, seems like a harder ask.)
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior.
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:
And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
AFAICT, this basically refutes the “combinatorial argument” for poetry being difficult to specify (while not doing the same for something like “deception”), since poetry is in fact not specified anywhere in the system’s explicit objective. (Meanwhile, the corresponding strategy for “deception”—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it’s a small target, but that it has a strange shape, which even prevents us from neatly defining a “convex hull” guaranteed to enclose it.)
However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an “anti-natural” concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there’s a “poem” out there consisting largely of what looks like unmetered prose, which one system classifies as “poetry” and the other doesn’t (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn’t (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you’ll notice, even humans often disagree on what constitutes poetry).
This doesn’t mean that the system can’t write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it’s the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a “misgeneralization” at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes “true poetry”. A “different opinion” about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!
(Actually, the argument I just gave can be viewed as a concrete shadow of the “convex hull” argument I gave initially; what it’s basically saying is that learning “poetry” is like drawing a hypersphere around some sort of convex polytope, whereas learning about “deception” is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape’s volume, but the parts of it you don’t capture matter!)
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
I’m not really able to extract a broader point out of this paragraph, sorry. These sentences don’t seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.
“These biases are quite robust to perturbations, so they can’t be too precise.” I don’t think there’s good evidence for this either way; humans are basically all trained “on-distribution”, so to speak. We don’t have observations for what happens in the case of “large” perturbations (that don’t immediately lead to death or otherwise life-impairing cognitive malfunction). Also, even on-distribution, I don’t know that I describe the resulting behavior as “robust”—see below.
“And genes are not long enough to encode something too unnatural.” Sure—which is why genes don’t encode things like “don’t deceive others”; instead, they encode proxy emotions like empathy and social reciprocation—which in turn break all the time, for all sorts of reasons. Doesn’t seem like a good model to emulate!
“And we have billions of examples to help us reverse engineer it.” Billions of examples of what? Reverse engineer what? Again, in the vein of my previous requests: I’d like to see some concreteness, here. There’s a lot of work you’re hiding inside of those abstract-sounding phrases.
“And we already have similar in some ways architecture working.” I think I straightforwardly don’t know what this is referring to, sorry. Could you give an example or three?
On the whole, my response to this part of your comment is probably best described as “mildly bemused”, with maybe a side helping of “gently skeptical”.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
I think (though I’m not certain) that what you’re trying to say here is that the same arguments I made for “deceiving the operators” being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I… disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn’t look to me like there’s any kind of thread connecting the two.
(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the “avoid deception” task. I don’t think that’s a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)
As for “10 times more honesty training”, well: it’s not clear to me how that would work in practice. I’ve already argued that it’s not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it’s not going to help much. The main issue here isn’t the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn’t break as it grows in capability.
To use a rough analogy: you can’t teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI’s decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it’s not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)
Anyway, since this comment has become quite long, here’s a short (ChatGPT-assisted) summary of the main points:
The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn’t necessarily inform us about the difficulty of learning the other.
Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system’s capabilities grow.
The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.
Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
Sure. Let’s assume an AI that uses model-based RL of a similar flavor as (I believe) is used in human brains.
Step 1: The thought “I am touching the hot stove” becomes aversive because it’s what I was thinking when I touched the hot stove, which caused pain etc. For details see my discussion of “credit assignment” here.
Step 2A: The thought “I desire to touch the hot stove” also becomes aversive because of its intimate connection to “I am touching the hot stove” from Step 1 above—i.e., in reality, if I desire to touch the hot stove, then it’s much more likely that I will in fact touch the hot stove, and my internal models are sophisticated enough to have picked up on that fact.
Mechanistically, this can happen in either the “forward direction” (when I think “I desire to touch the hot stove”, my brain’s internal models explore the consequences, and that weakly activates “I am touching the hot stove” neurons, which in turn trigger aversion), or in the “backward direction” (involving credit assignment & TD learning, see the diagram about habit-formation here).
Anyway, if the thought “I desire to touch the hot stove” indeed becomes aversive, that’s a kind of meta-preference—a desire not to have a desire.
Step 2B: Conversely and simultaneously, the thought “I desire to not touch the hot stove” becomes appealing for basically similar reasons, and that’s another meta-preference (desire to have a desire).
Happy to discuss more details; see also §10.5.4 here.
Some clarifications:
I’m not listing every desire & meta-preference that occurs in this hot-stove story, just a small subset of them, and other desires (and meta-desires) that I didn’t mention could even be pushing in the opposite direction.
I’m not listing the only (or necessary even primary) way that meta-preferences can arise, just one of the ways. In particular, in neurotypical humans, my hunch is that the strongest meta-preferences tend to arise (directly or indirectly) from social instincts, and I’m currently hazy on the mechanistic details of that.
If you tell me that you’d like to make an AGI with meta-preference X, and you ask me what procedure to follow such that this will definitely happen, my answer right now is basically “I don’t know, sorry”.
Nice, thanks! (Upvoted.)
So, when I try to translate this line of thinking into the context of deception (or other instrumentally undesirable behaviors), I notice that I mostly can’t tell what “touching the hot stove” ends up corresponding to. This might seem like a nitpick, but I think it’s actually quite a crucial distinction: by substituting a complex phenomenon like deceptive (manipulative) behavior for a simpler (approximately atomic) action like “touching a hot stove”, I think your analogy has elided some important complexities that arise specifically in the context of deception (strategic operator-manipulation).
When it comes to deception (strategic operator-manipulation), the “hot stove” equivalent isn’t a single, easily identifiable action or event; instead, it’s a more abstract concept that manifests in various forms and contexts. In practice, I would initially expect the “hot stove” flinches the system experiences to correspond to whatever (simplistic) deception-predicates were included in its reward function. Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are “pointing at” in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like “touching a hot stove”.
In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of “reflection” and arrive at an endorsed desire.
But when I lay things out like this, I notice that my intuition quite concretely expects that this process will not shake out in a safe way. I expect the system to notice the true fact that [whatever object-level goals it may have] are being impeded by the “hot stove” flinches, and that it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
(Incidentally, my Nate model agrees quite strongly with the above, and considers it a strong reason why he views this kind of reflection as “inherently” dangerous.)
Based on what you wrote in your bullet points, I take it you don’t necessarily disagree with anything I just wrote (hence your talk of being “hazy on the mechanistic details” and “I don’t know, sorry” being your current answer to making an AGI with a certain meta-preference). It’s plausible to me that our primary disagreement here stems from my being substantially less optimistic about these details being solvable via “simple” methods.
Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.
If you’re saying that this is a possible failure mode, yes I agree.
If you’re saying that this is an inevitable failure mode, that’s at least not obvious to me.
I don’t see why two desires that trade off against each other can’t possibly stay balanced in a reflectively-stable way. Happy to dive into details on that. For example, if a rational agent has utility function log(A)+log(B) (or sorta-equivalently, A×B), then the agent will probably split its time / investment between A & B, and that’s sorta an existence proof that you can have a reflectively-stable agent that “desires two different things”, so to speak. See also a bit of discussion here.
I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?
In my mind, it should be basically symmetric:
Pursuing a desire to be non-deceptive makes it harder to invent nanotech.
Pursuing a desire to invent nanotech makes it harder to be non-deceptive.
One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.
I go through an example with the complex messy concept of “human flourishing” in this post. But don’t expect to find any thoughtful elegant solution! :-P Here’s the relevant part:
Thanks again for responding! My response here is going to be out-of-order w.r.t. your comment, as I think the middle part here is actually the critical bit:
So, I do think there’s an asymmetry here, in that I mostly expect “avoid deception” is a less natural category than “invent nanotech”, and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables. That claim is a little abstract for my taste, so here’s an attempt to convey a more concrete feel for the intuition behind it:
Early on during training (when the system can’t really be characterized as “trying” to do anything), I expect that a naive attempt at training against deception, while simultaneously training towards an object-level goal like “invent nanotech” (or, perhaps more concretely, “engage in iterative experimentation with the goal of synthesizing proteins well-suited for task X”), will involve a reward function that looks a whole lot more like an “invent nanotech” reward function, plus a bunch of deception-predicates that apply negative reward (“flinches”) to matching thoughts, than it will an “avoid deception” reward function, plus a bunch of “invent nanotech”-predicates that apply reward based on… I’m not even sure what the predicates in question would look like, actually.
I think this evinces a deep difference between “avoid deceptive behavior” and “invent nanotech”, whose True Name might be something like… the former is an injunction against a large category of possible behaviors, whereas the latter is an exhortation towards a concrete goal (while proposing few-to-no constraints on the path toward said goal). Insofar as I expect specifying a concrete goal to be easier than specifying a whole category of behaviors (especially when the category in question may not be at all “natural”), I think I likewise expect reward functions attempting to do both things at once to be much better at actually zooming in on something like “invent nanotech”, while being limited to doing “flinch-like” things for “don’t manipulate your operators”—which would, in practice, result in a reward function that looks basically like what I described above.
I think, with this explanation in hand, I feel better equipped to go back and address the first part of your comment:
I mostly don’t think I want to describe an AGI trained to invent nanotech while avoiding deceptive/manipulative behavior as “an AGI that simultaneously [desires to invent nanotech] and [desires not to deceive its operators]”. Insofar as I expect an AGI trained that way to end up with “desires” we might characterize as “reflective, endorsed, and coherent”, I mostly don’t expect any “flinch-like” reflexes instilled during training to survive reflection and crystallize into anything at all.
I would instead say: a nascent AGI has no (reflective) desires to begin with, and as its cognition is shaped during training, it acquires various cognitive strategies in response to that training, some of which might be characterized as “strategic”, and others of which might be characterized as “reflexive”—and I expect the former to have a much better chance than the latter of making it into the AGI’s ultimate values.
More concretely, I continue to endorse this description (from my previous comment) of what I expect to happen to an AGI system working on assembling itself into a coherent agent:
That reflection process, on my model, is a difficult gauntlet to pass through (I actually think we observe this to some extent even for humans!), and many reflexive (flinch-like) behaviors don’t make it through the gauntlet at all. It’s for this reason that I think the plan you describe in that quoted Q&A is… maybe not totally doomed (though my Eliezer and Nate models certainly think so!), but still mostly doomed.
I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for various reasons, but it’s one of the things I’ve been thinking about. (My “plan” here, such as it is, doesn’t really look like that.)
Hmm, reading between the lines, I wonder if your intuitions are sensing a more general asymmetry where rewards are generally more likely to lead to reflectively-endorsed preferences than punishments? If so, that seems pretty plausible to me, at least other things equal.
Mechanistically: If “nanotech” has positive valence, than “I am working on nanotech” would inherit some positive valence too (details—see the paragraph that starts “slightly more detail”), as would brainstorming how to get nanotech, etc. Whereas if “deception” has negative valence (a.k.a. is aversive), then the very act of brainstorming whether something might be deceptive would itself be somewhat aversive, again for reasons mentioned here.
This is kinda related to confirmation bias. If the idea “my plan will fail” or “I’m wrong” is aversive, then “brainstorming how my plan might fail” or “brainstorming why I’m wrong” is somewhat aversive too. So people don’t do it. It’s just a deficiency of this kind of algorithm. It’s obviously not a fatal deficiency—at least some humans, sometimes, avoid confirmation bias. Basically, I think the trained model can learn a meta-heuristic that recognizes these situations (at least sometimes) and strongly votes to brainstorm anyway.
By the same token, I think it is true that the human brain RL algorithm has a default behavior of being less effective at avoiding punishments than seeking out rewards, because, again, brainstorming how to avoid punishments is aversive, and brainstorming how to get rewards is pleasant. (And the reflectively-endorsed-desire thing is a special case of that, or at least closely related.)
This deficiency in the algorithm might get magically patched over by a learned meta-heuristic, in which case maybe a set of punishments could lead to a reflectively-endorsed preference despite the odds stacked against it. We can also think about how to mitigate that problem by “rewarding the algorithm for acting virtuously” rather than punishing it for acting deceptively, or whatever.
(NB: I called that aspect of the brain algorithm a “deficiency” rather than “flaw” or “bug” because I don’t think it’s fixable without losing essential aspects of intelligence. I think the only way to get a “rational” AGI without confirmation bias etc. is to have the AGI read the Sequences, or rediscover the same ideas, or whatever, same as us humans, thus patching over all the algorithmic quirks with learned meta-heuristics. I think this is an area where I disagree with Nate & Eliezer.)
Anyway, I appreciate your comment!!
Yeah, thanks for engaging with me! You’ve definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don’t have fully put-together thoughts on that yet.)
Have it been quantitatively argued somewhere at all why such naturalness matters? Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something? Because it does work in humans. And “invent nanotech” or “write poetry” are also small targets and training works for them.
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it’s literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of “privileged” abstractions.
In this frame, misgeneralization is what happens when your (non-combinatorially vast) training data fails to specify a particular concept, and you end up learning an alternative abstraction that is consistent with the data, but doesn’t generalize as expected. This is why naturalness matters: because the more “natural” a category or abstraction is, the more likely it is to be one of those privileged abstractions that can be learned from a relatively small amount of data.
Of course, that doesn’t establish that “deceptive behavior” is an unnatural category per se—but I would argue that our inability to pin down a precise definition of deceptive behavior, along with the complexity and context-dependency of the concept, suggests that it may not be one of those privileged, natural abstractions. In other words, learning to avoid deceptive behavior might require a lot more data and nuanced understanding than learning more natural categories—and unfortunately, neither of those seem (to me) to be very easily achievable!
(See also: previous comment. :-P)
Having read my above response, it should (hopefully) be predictable enough what I’m going to say here The bluntest version of my response might take the form of a pair of questions: whence the training data? And whence the bias?
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
The best I can manage, in practice, is to imagine a reward function with simplistic deception-predicates hooked up to negative rewards, which basically zaps the system every time it thinks a thought matching one (or more) of the predicates. But as I noted in my previous comment(s), all this approach seems likely to achieve is instilling a set of “flinch-like” reflexes into the system—and I think such reflexes are unlikely to unfold into any kind of reflectively stable (“ego-syntonic”, in Steven’s terms) desire to avoid deceptive/manipulative behavior.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior. (But also: even with this, humans don’t reliably avoid deceiving other humans!)
Well, notably not “invent nanotech” (not yet, anyway :-P). And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
(Situating “avoid deception” as part of a larger task, meanwhile, seems like a harder ask.)
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:
AFAICT, this basically refutes the “combinatorial argument” for poetry being difficult to specify (while not doing the same for something like “deception”), since poetry is in fact not specified anywhere in the system’s explicit objective. (Meanwhile, the corresponding strategy for “deception”—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it’s a small target, but that it has a strange shape, which even prevents us from neatly defining a “convex hull” guaranteed to enclose it.)
However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an “anti-natural” concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there’s a “poem” out there consisting largely of what looks like unmetered prose, which one system classifies as “poetry” and the other doesn’t (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn’t (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you’ll notice, even humans often disagree on what constitutes poetry).
This doesn’t mean that the system can’t write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it’s the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a “misgeneralization” at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes “true poetry”. A “different opinion” about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!
(Actually, the argument I just gave can be viewed as a concrete shadow of the “convex hull” argument I gave initially; what it’s basically saying is that learning “poetry” is like drawing a hypersphere around some sort of convex polytope, whereas learning about “deception” is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape’s volume, but the parts of it you don’t capture matter!)
I’m not really able to extract a broader point out of this paragraph, sorry. These sentences don’t seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.
“These biases are quite robust to perturbations, so they can’t be too precise.” I don’t think there’s good evidence for this either way; humans are basically all trained “on-distribution”, so to speak. We don’t have observations for what happens in the case of “large” perturbations (that don’t immediately lead to death or otherwise life-impairing cognitive malfunction). Also, even on-distribution, I don’t know that I describe the resulting behavior as “robust”—see below.
“And genes are not long enough to encode something too unnatural.” Sure—which is why genes don’t encode things like “don’t deceive others”; instead, they encode proxy emotions like empathy and social reciprocation—which in turn break all the time, for all sorts of reasons. Doesn’t seem like a good model to emulate!
“And we have billions of examples to help us reverse engineer it.” Billions of examples of what? Reverse engineer what? Again, in the vein of my previous requests: I’d like to see some concreteness, here. There’s a lot of work you’re hiding inside of those abstract-sounding phrases.
“And we already have similar in some ways architecture working.” I think I straightforwardly don’t know what this is referring to, sorry. Could you give an example or three?
On the whole, my response to this part of your comment is probably best described as “mildly bemused”, with maybe a side helping of “gently skeptical”.
I think (though I’m not certain) that what you’re trying to say here is that the same arguments I made for “deceiving the operators” being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I… disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn’t look to me like there’s any kind of thread connecting the two.
(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the “avoid deception” task. I don’t think that’s a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)
As for “10 times more honesty training”, well: it’s not clear to me how that would work in practice. I’ve already argued that it’s not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it’s not going to help much. The main issue here isn’t the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn’t break as it grows in capability.
To use a rough analogy: you can’t teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI’s decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it’s not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)
Anyway, since this comment has become quite long, here’s a short (ChatGPT-assisted) summary of the main points:
The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn’t necessarily inform us about the difficulty of learning the other.
Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system’s capabilities grow.
The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.
Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.