I think your example was doomed from the start because
the AGI was exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “the nanotech problem will get solved”,
the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
So the latter is obviously doomed to get crushed by a sufficiently-intelligent AGI.
If we can get to a place where the first bullet point still holds, but the AGI also has a comparably-strong, explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”, then we’re in a situation where the AGI is applying its formidable intelligence to fight for both bullet points, not just the first one. And then we can be more hopeful that the second bullet point won’t get crushed. (Related.)
In particular, if we can pull that off, then the AGI would presumably do “intelligent” things to advance the second bullet point, just like it does “intelligent” things to advance the first bullet point in your story. For example, the AGI might brainstorm subtle ways that its plans might pattern-match to deception, and feel great relief (so to speak) at noticing and avoiding those problems before they happen. And likewise, it might brainstorm clever ways to communicate more clearly with its supervisor, and treat those as wonderful achievements (so to speak). Etc.
Of course, there remains the very interesting open question of how to reliably get to a place where the AGI has an explicit, endorsed, strong desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno. (More detailed discussion here.) For example, most humans get zapped with positive reward when they eat yummy ice cream, and yet the USA population seems to have wound up pretty spread out along the spectrum from fully endorsing the associated desire as ego-syntonic (“Eating ice cream is friggin awesome!”) to fully rejecting & externalizing it as ego-dystonic (“I sometimes struggle with a difficult-to-control urge to eat ice cream”). Again, I think there are important open questions about how this process works, and more to the point, how to intervene on it for an AGI.
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
Yeah, so this is the part that I (even on my actual model) find implausible (to say nothing of my Nate/Eliezer/MIRI models, which basically scoff and say accusatory things about anthropomorphism here). I think what would really help me understand this is a concrete story—akin to the story Nate told in the top-level post—in which the “maybe” branch actually happens—where the AGI, after being zapped with enough negative reward, forms a “reflectively-endorsed desire to be helpful / docile / etc.”, so that I could poke at that story to see if / where it breaks.
(I recognize that this is a big ask! But I do think it, or something serving a similar function, needs to happen at some point for people’s abstract intuitions to “make contact with reality”, after a fashion, as opposed to being purely abstract all the time. This is something I’ve always felt, but it recently became starker after reading Holden’s summary of his conversation with Nate; I now think the disparity between having abstract high-level models with black-box concepts like “reflectively endorsed desires” and having a concrete mental picture of how things play out is critical for understanding, despite the latter being almost certainly wrong in the details.)
Sure. Let’s assume an AI that uses model-based RL of a similar flavor as (I believe) is used in human brains.
Step 1: The thought “I am touching the hot stove” becomes aversive because it’s what I was thinking when I touched the hot stove, which caused pain etc. For details see my discussion of “credit assignment” here.
Step 2A: The thought “I desire to touch the hot stove” also becomes aversive because of its intimate connection to “I am touching the hot stove” from Step 1 above—i.e., in reality, if I desire to touch the hot stove, then it’s much more likely that I will in fact touch the hot stove, and my internal models are sophisticated enough to have picked up on that fact.
Mechanistically, this can happen in either the “forward direction” (when I think “I desire to touch the hot stove”, my brain’s internal models explore the consequences, and that weakly activates “I am touching the hot stove” neurons, which in turn trigger aversion), or in the “backward direction” (involving credit assignment & TD learning, see the diagram about habit-formation here).
Anyway, if the thought “I desire to touch the hot stove” indeed becomes aversive, that’s a kind of meta-preference—a desire not to have a desire.
Step 2B: Conversely and simultaneously, the thought “I desire to not touch the hot stove” becomes appealing for basically similar reasons, and that’s another meta-preference (desire to have a desire).
Happy to discuss more details; see also §10.5.4 here.
Some clarifications:
I’m not listing every desire & meta-preference that occurs in this hot-stove story, just a small subset of them, and other desires (and meta-desires) that I didn’t mention could even be pushing in the opposite direction.
I’m not listing the only (or necessary even primary) way that meta-preferences can arise, just one of the ways. In particular, in neurotypical humans, my hunch is that the strongest meta-preferences tend to arise (directly or indirectly) from social instincts, and I’m currently hazy on the mechanistic details of that.
If you tell me that you’d like to make an AGI with meta-preference X, and you ask me what procedure to follow such that this will definitely happen, my answer right now is basically “I don’t know, sorry”.
So, when I try to translate this line of thinking into the context of deception (or other instrumentally undesirable behaviors), I notice that I mostly can’t tell what “touching the hot stove” ends up corresponding to. This might seem like a nitpick, but I think it’s actually quite a crucial distinction: by substituting a complex phenomenon like deceptive (manipulative) behavior for a simpler (approximately atomic) action like “touching a hot stove”, I think your analogy has elided some important complexities that arise specifically in the context of deception (strategic operator-manipulation).
When it comes to deception (strategic operator-manipulation), the “hot stove” equivalent isn’t a single, easily identifiable action or event; instead, it’s a more abstract concept that manifests in various forms and contexts. In practice, I would initially expect the “hot stove” flinches the system experiences to correspond to whatever (simplistic) deception-predicates were included in its reward function. Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are “pointing at” in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like “touching a hot stove”.
In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of “reflection” and arrive at an endorsed desire.
But when I lay things out like this, I notice that my intuition quite concretely expects that this process will not shake out in a safe way. I expect the system to notice the true fact that [whatever object-level goals it may have] are being impeded by the “hot stove” flinches, and that it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
(Incidentally, my Nate model agrees quite strongly with the above, and considers it a strong reason why he views this kind of reflection as “inherently” dangerous.)
Based on what you wrote in your bullet points, I take it you don’t necessarily disagree with anything I just wrote (hence your talk of being “hazy on the mechanistic details” and “I don’t know, sorry” being your current answer to making an AGI with a certain meta-preference). It’s plausible to me that our primary disagreement here stems from my being substantially less optimistic about these details being solvable via “simple” methods.
it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.
If you’re saying that this is a possible failure mode, yes I agree.
If you’re saying that this is an inevitable failure mode, that’s at least not obvious to me.
I don’t see why two desires that trade off against each other can’t possibly stay balanced in a reflectively-stable way. Happy to dive into details on that. For example, if a rational agent has utility function log(A)+log(B) (or sorta-equivalently, A×B), then the agent will probably split its time / investment between A & B, and that’s sorta an existence proof that you can have a reflectively-stable agent that “desires two different things”, so to speak. See also a bit of discussion here.
I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?
In my mind, it should be basically symmetric:
Pursuing a desire to be non-deceptive makes it harder to invent nanotech.
Pursuing a desire to invent nanotech makes it harder to be non-deceptive.
One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.
[deception is] a more abstract concept that manifests in various forms and context…
I go through an example with the complex messy concept of “human flourishing” in this post. But don’t expect to find any thoughtful elegant solution! :-P Here’s the relevant part:
Q: What about ontological crises / what Stuart Armstrong calls “Concept Extrapolation” / what Scott Alexander calls “the tails coming apart”? In other words, as the AGI learns more and/or considers out-of-distribution plans, it might come find that the web-of-associations corresponding to the “human flourishing” concept are splitting apart. Then what does it do?
A: I talk about that much more in §14.4 here, but basically I don’t know. The plan here is to just hope for the best. More specifically: As the AGI learns new things about the world, and as the world itself changes, the “human flourishing” concept will stop pointing to a coherent “cluster in thingspace”, and the AGI will decide somehow or other what it cares about, in its new understanding of the world. According to the plan discussed in this blog post, we have no control over how that process will unfold and where it will end up. Hopefully somewhere good, but who knows?
Thanks again for responding! My response here is going to be out-of-order w.r.t. your comment, as I think the middle part here is actually the critical bit:
I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?
In my mind, it should be basically symmetric:
Pursuing a desire to be non-deceptive makes it harder to invent nanotech.
Pursuing a desire to invent nanotech makes it harder to be non-deceptive.
One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.
So, I do think there’s an asymmetry here, in that I mostly expect “avoid deception” is a less natural category than “invent nanotech”, and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables. That claim is a little abstract for my taste, so here’s an attempt to convey a more concrete feel for the intuition behind it:
Early on during training (when the system can’t really be characterized as “trying” to do anything), I expect that a naive attempt at training against deception, while simultaneously training towards an object-level goal like “invent nanotech” (or, perhaps more concretely, “engage in iterative experimentation with the goal of synthesizing proteins well-suited for task X”), will involve a reward function that looks a whole lot more like an “invent nanotech” reward function, plus a bunch of deception-predicates that apply negative reward (“flinches”) to matching thoughts, than it will an “avoid deception” reward function, plus a bunch of “invent nanotech”-predicates that apply reward based on… I’m not even sure what the predicates in question would look like, actually.
I think this evinces a deep difference between “avoid deceptive behavior” and “invent nanotech”, whose True Name might be something like… the former is an injunction against a large category of possible behaviors, whereas the latter is an exhortation towards a concrete goal (while proposing few-to-no constraints on the path toward said goal). Insofar as I expect specifying a concrete goal to be easier than specifying a whole category of behaviors (especially when the category in question may not be at all “natural”), I think I likewise expect reward functions attempting to do both things at once to be much better at actually zooming in on something like “invent nanotech”, while being limited to doing “flinch-like” things for “don’t manipulate your operators”—which would, in practice, result in a reward function that looks basically like what I described above.
I think, with this explanation in hand, I feel better equipped to go back and address the first part of your comment:
Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.
I mostly don’t think I want to describe an AGI trained to invent nanotech while avoiding deceptive/manipulative behavior as “an AGI that simultaneously [desires to invent nanotech] and [desires not to deceive its operators]”. Insofar as I expect an AGI trained that way to end up with “desires” we might characterize as “reflective, endorsed, and coherent”, I mostly don’t expect any “flinch-like” reflexes instilled during training to survive reflection and crystallize into anything at all.
I would instead say: a nascent AGI has no (reflective) desires to begin with, and as its cognition is shaped during training, it acquires various cognitive strategies in response to that training, some of which might be characterized as “strategic”, and others of which might be characterized as “reflexive”—and I expect the former to have a much better chance than the latter of making it into the AGI’s ultimate values.
More concretely, I continue to endorse this description (from my previous comment) of what I expect to happen to an AGI system working on assembling itself into a coherent agent:
Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are “pointing at” in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like “touching a hot stove”.
In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of “reflection” and arrive at an endorsed desire.
That reflection process, on my model, is a difficult gauntlet to pass through (I actually think we observe this to some extent even for humans!), and many reflexive (flinch-like) behaviors don’t make it through the gauntlet at all. It’s for this reason that I think the plan you describe in that quoted Q&A is… maybe not totally doomed (though my Eliezer and Nate models certainly think so!), but still mostly doomed.
I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for various reasons, but it’s one of the things I’ve been thinking about. (My “plan” here, such as it is, doesn’t really look like that.)
Hmm, reading between the lines, I wonder if your intuitions are sensing a more general asymmetry where rewards are generally more likely to lead to reflectively-endorsed preferences than punishments? If so, that seems pretty plausible to me, at least other things equal.
Mechanistically: If “nanotech” has positive valence, than “I am working on nanotech” would inherit some positive valence too (details—see the paragraph that starts “slightly more detail”), as would brainstorming how to get nanotech, etc. Whereas if “deception” has negative valence (a.k.a. is aversive), then the very act of brainstorming whether something might be deceptive would itself be somewhat aversive, again for reasons mentioned here.
This is kinda related to confirmation bias. If the idea “my plan will fail” or “I’m wrong” is aversive, then “brainstorming how my plan might fail” or “brainstorming why I’m wrong” is somewhat aversive too. So people don’t do it. It’s just a deficiency of this kind of algorithm. It’s obviously not a fatal deficiency—at least some humans, sometimes, avoid confirmation bias. Basically, I think the trained model can learn a meta-heuristic that recognizes these situations (at least sometimes) and strongly votes to brainstorm anyway.
By the same token, I think it is true that the human brain RL algorithm has a default behavior of being less effective at avoiding punishments than seeking out rewards, because, again, brainstorming how to avoid punishments is aversive, and brainstorming how to get rewards is pleasant. (And the reflectively-endorsed-desire thing is a special case of that, or at least closely related.)
This deficiency in the algorithm might get magically patched over by a learned meta-heuristic, in which case maybe a set of punishments could lead to a reflectively-endorsed preference despite the odds stacked against it. We can also think about how to mitigate that problem by “rewarding the algorithm for acting virtuously” rather than punishing it for acting deceptively, or whatever.
(NB: I called that aspect of the brain algorithm a “deficiency” rather than “flaw” or “bug” because I don’t think it’s fixable without losing essential aspects of intelligence. I think the only way to get a “rational” AGI without confirmation bias etc. is to have the AGI read the Sequences, or rediscover the same ideas, or whatever, same as us humans, thus patching over all the algorithmic quirks with learned meta-heuristics. I think this is an area where I disagree with Nate & Eliezer.)
Yeah, thanks for engaging with me! You’ve definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don’t have fully put-together thoughts on that yet.)
So, I do think there’s an asymmetry here, in that I mostly expect “avoid deception” is a less natural category than “invent nanotech”, and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables.
Have it been quantitatively argued somewhere at all why such naturalness matters? Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something? Because it does work in humans. And “invent nanotech” or “write poetry” are also small targets and training works for them.
Have it been quantitatively argued somewhere at all why such naturalness matters?
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it’s literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of “privileged” abstractions.
In this frame, misgeneralization is what happens when your (non-combinatorially vast) training data fails to specify a particular concept, and you end up learning an alternative abstraction that is consistent with the data, but doesn’t generalize as expected. This is why naturalness matters: because the more “natural” a category or abstraction is, the more likely it is to be one of those privileged abstractions that can be learned from a relatively small amount of data.
Of course, that doesn’t establish that “deceptive behavior” is an unnatural category per se—but I would argue that our inability to pin down a precise definition of deceptive behavior, along with the complexity and context-dependency of the concept, suggests that it may not be one of those privileged, natural abstractions. In other words, learning to avoid deceptive behavior might require a lot more data and nuanced understanding than learning more natural categories—and unfortunately, neither of those seem (to me) to be very easily achievable!
(See also: previous comment. :-P)
Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something?
Having read my above response, it should (hopefully) be predictable enough what I’m going to say here The bluntest version of my response might take the form of a pair of questions: whence the training data? And whence the bias?
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
The best I can manage, in practice, is to imagine a reward function with simplistic deception-predicates hooked up to negative rewards, which basically zaps the system every time it thinks a thought matching one (or more) of the predicates. But as I noted in my previous comment(s), all this approach seems likely to achieve is instilling a set of “flinch-like” reflexes into the system—and I think such reflexes are unlikely to unfold into any kind of reflectively stable (“ego-syntonic”, in Steven’s terms) desire to avoid deceptive/manipulative behavior.
Because it does work in humans.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior. (But also: even with this, humans don’t reliably avoid deceiving other humans!)
And “invent nanotech” or “write poetry” are also small targets and training works for them.
Well, notably not “invent nanotech” (not yet, anyway :-P). And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
(Situating “avoid deception” as part of a larger task, meanwhile, seems like a harder ask.)
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior.
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:
And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
AFAICT, this basically refutes the “combinatorial argument” for poetry being difficult to specify (while not doing the same for something like “deception”), since poetry is in fact not specified anywhere in the system’s explicit objective. (Meanwhile, the corresponding strategy for “deception”—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it’s a small target, but that it has a strange shape, which even prevents us from neatly defining a “convex hull” guaranteed to enclose it.)
However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an “anti-natural” concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there’s a “poem” out there consisting largely of what looks like unmetered prose, which one system classifies as “poetry” and the other doesn’t (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn’t (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you’ll notice, even humans often disagree on what constitutes poetry).
This doesn’t mean that the system can’t write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it’s the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a “misgeneralization” at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes “true poetry”. A “different opinion” about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!
(Actually, the argument I just gave can be viewed as a concrete shadow of the “convex hull” argument I gave initially; what it’s basically saying is that learning “poetry” is like drawing a hypersphere around some sort of convex polytope, whereas learning about “deception” is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape’s volume, but the parts of it you don’t capture matter!)
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
I’m not really able to extract a broader point out of this paragraph, sorry. These sentences don’t seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.
“These biases are quite robust to perturbations, so they can’t be too precise.” I don’t think there’s good evidence for this either way; humans are basically all trained “on-distribution”, so to speak. We don’t have observations for what happens in the case of “large” perturbations (that don’t immediately lead to death or otherwise life-impairing cognitive malfunction). Also, even on-distribution, I don’t know that I describe the resulting behavior as “robust”—see below.
“And genes are not long enough to encode something too unnatural.” Sure—which is why genes don’t encode things like “don’t deceive others”; instead, they encode proxy emotions like empathy and social reciprocation—which in turn break all the time, for all sorts of reasons. Doesn’t seem like a good model to emulate!
“And we have billions of examples to help us reverse engineer it.” Billions of examples of what? Reverse engineer what? Again, in the vein of my previous requests: I’d like to see some concreteness, here. There’s a lot of work you’re hiding inside of those abstract-sounding phrases.
“And we already have similar in some ways architecture working.” I think I straightforwardly don’t know what this is referring to, sorry. Could you give an example or three?
On the whole, my response to this part of your comment is probably best described as “mildly bemused”, with maybe a side helping of “gently skeptical”.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
I think (though I’m not certain) that what you’re trying to say here is that the same arguments I made for “deceiving the operators” being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I… disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn’t look to me like there’s any kind of thread connecting the two.
(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the “avoid deception” task. I don’t think that’s a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)
As for “10 times more honesty training”, well: it’s not clear to me how that would work in practice. I’ve already argued that it’s not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it’s not going to help much. The main issue here isn’t the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn’t break as it grows in capability.
To use a rough analogy: you can’t teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI’s decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it’s not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)
Anyway, since this comment has become quite long, here’s a short (ChatGPT-assisted) summary of the main points:
The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn’t necessarily inform us about the difficulty of learning the other.
Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system’s capabilities grow.
The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.
Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
I am naively more scared about such an AI. That AI sounds more like if I say “you’re not being helpful, please stop” that it will respond “actually I thought about it, I disagree, I’m going to continue doing what I think is helpful”.
I think that, if an AGI has any explicit reflectively-endorsed desire whatsoever, then I can tell a similar scary story: The AGI’s desire isn’t quite what I wanted, so I try to correct it, and the AGI says no. (Unless the AGI’s explicit endorsed desires include / entail a desire to accept correction! Which most desires don’t!)
And yes, that is a scary story! It is the central scary story of AGI alignment, right? It would be nice to make an AGI with no explicit desires whatsoever, but I don’t think that’s possible.
So anyway, if we do Procedure X which will nominally lead to an AGI with an explicit reflectively-endorsed desire to accept corrections to its desires, then one might think that we’re in the ironic situation that the AGI will accept further corrections to that desire if and only if we don’t need to give it corrections in the first place 😛 (i.e. because Procedure X went perfectly and the desire is already exactly right). That would be cute and grimly amusing if true, and it certainly has a kernel of truth, but it’s a bit oversimplified if we take it literally, I think.
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):
In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (I think you already agree with this, but it seemed worth pointing out that this is a pretty harsh set of prerequisites, especially given that we don’t have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).
The concept of docility that you want to align it to needs be very specific and robust against lots of different kinds of thinking. You need it to conclude that you don’t want it to deceive you / train itself for a bit longer / escape containment / etc, but at the same time you don’t want it to extrapolate out your intent too much (it could be so much more helpful if it did train itself for a little longer, or if it had a copy of itself running on more compute, or it learns that there are some people out there who would like it if the AGI were free, or something else I haven’t thought of)
You only have limited bits of optimization to expend on getting it to be inner aligned bc of deceptive alignment.
There’s all the classic problems with corrigibility vs. consequentialism (and you can’t get around those by building something that is not a reflective consequentialist, because that again is not stable under capability gains).
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.
I want to be clear that the “zapping” thing I wrote is a really crap plan, and I hope we can do better, and I feel odd defending it. My least-worst current alignment plan, such as it is, is here, and doesn’t look like that at all. In fact, the way I wrote it, it doesn’t attempt corrigibility in the first place.
But anyway…
First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.
Second bullet point → Ditto
Third bullet point → Doesn’t that apply to any goal you want the AGI to have? The context was: I think OP was assuming that we can make an AGI that’s sincerely trying to invent nanotech, and then saying that deception was a different and harder problem. It’s true that deception makes alignment hard, but that’s true for whatever goal we’re trying to install. Deception makes it hard to make an AGI that’s trying in good faith to invent nanotech, and deception also makes it hard to make an AGI that’s trying in good faith to have open and honest communication with its human supervisor. This doesn’t seem like a differential issue. But anyway, I’m not disagreeing. I do think I would frame the issue differently though: I would say “zapping the AGI for being deceptive” looks identical to “zapping the AGI for getting caught being deceptive”, at least by default, and thus the possibility of Goal Mis-Generalization wields its ugly head.
Fourth bullet point → I disagree for reasons here.
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.
I’m arguing that it’s definitely not going to work (I don’t have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).
First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.
Do you mean we possibly don’t need the prerequisites, or we definitely need them but that’s possibly fine?
Do you mean we possibly don’t need the prerequisites, or we definitely need them but that’s possibly fine?
I’m gonna pause to make sure we’re on the same page.
We’re talking about this claim I made above:
if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
And you’re trying to argue: “‘Maybe, maybe not’ is too optimistic, the correct answer is ‘(almost) definitely not’”.
And then by “prerequisites” we’re referring to the thing you wrote above:
In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (…this is a pretty harsh set of prerequisites, especially given that we don’t have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).
OK, now to respond.
For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?
For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary. (I have previously written about that here.)
For yet another thing, I think if the “toddler AGI” is not yet sophisticated enough to have a reflectively-endorsed desire for open and honest communication (or whatever), that’s different from saying that the toddler AGI is totally out to get us. It can still have habits and desires and inclinations and aversions and such, of various sorts, and we have some (imperfect) control over what those are. We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.
Yeah we’re on the same page here, thanks for checking!
For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?
I feel pretty uncertain about all the factors here. One reason I overall still lean towards the ‘definitely not’ stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the problem of actually aligning the toddler AGI. (Another step is getting labs to even seriously attempt to box it and align it, which maybe is an out-of-scope consideration here but it does make me more pessimistic).
For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary.
I agree we’re not talking about a superintelligent adversary, and I agree that boxing is doable for some forms of toddler AGI. I do think you need coherence; if the toddler AGI is incoherent, then any “aligned” behavioral properties it has will also be incoherent, and something unpredictable (and so probably bad) will happen when the AGI becomes more capable or more coherent. (Flagging that I’m not sure “coherent” is the right way to talk about this… wish I had a more precise concept here.)
We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.
I agree a non-reflective toddler AGI is in many ways easier to deal with. I think we will have problems at the threshold where the tAGI is first able to reflect on its goals and realizes that the RLHF-instilled desires aren’t going to imply docile behavior. (If we can speculate about how a superintelligence might extrapolate a set of trained-in desires and realize that this process doesn’t lead to a good outcome, then the tAGI can reason the same way about its own desires).
(I agree that if we can get aligned desires that are stable under reflection, then maybe the ‘use non-endorsed desires to tide us over’ plan could work. Though even then you need to somehow manage to prevent the tAGI from reflecting on its desires until you get the desires to a point where they stay aligned under reflection, and I have no idea how you would do something like that—we currently just don’t have that level of fine control over capabilities).
The basic problem here is the double-bind where we need the toddler AGI to be coherent, reflective, capable of understanding human intent (etc) in order for it to be robustly alignable at all, even though those are exactly the incredibly dangerous properties that we really want to stay away from. My guess is that the reason Nate’s story doesn’t hypothesize a reflectively-endorsed desire to be nondeceptive is that reflectively-stable aligned desires are really hard / dangerous to get, and so it seems better / at least not obviously worse to go for eliezer-corrigibility instead.
Some other difficulties that I see:
The ‘capability profile’ (ie the relative levels of the toddler AGI’s capabilities) is going to be weird / very different from that of humans; that is, once the AGI has human-level coherence and human-level understanding of human intent, it has far-superhuman capabilities in other domains. (Though hopefully we’re at least careful enough to remove code from the training data, etc).
A coherent agentic AI at GPT-4 level capabilities could plausibly already be deceptively aligned, if it had sufficient situational awareness, and our toddler AGI is much more dangerous than that.
All of my reasoning here is kind of based on fuzzy confused concepts like ‘coherence’ and ‘capability to self-reflect’, and I kind of feel like this should make me more pessimistic rather than more optimistic about the plan.
It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I’m missing some of the problems.
(Analogy: say I’m working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I’m actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
The problem is deeper. The AGI doesn’t recognize its deceptiveness, and so it self-deceives. It would judge that it is being helpful and docile, if it was trained to be those things, and most importantly the meaning of those words will be changed by the deception, much like we keep using words like “person”, “world”, “self”, “should”, etc. in ways absolutely contrary to our ancestors’ deeply-held beliefs and values. The existence of an optimization process does not imply an internal theory of value-alignment strong enough to recognize the failure modes when values are violated in novel ways because it doesn’t know what values really are and how they mechanistically work in the universe, and so can’t check the state of its values against base reality.
To make this concrete in relation to the story, the overall system has a nominal value to not deceive human operators. Once human/lab-interaction tasks are identified as logical problems that can be solved in a domain specific language, that value is no longer practically applied to the output of the system as a whole because it is self-deceived into thinking the optimized instructions are not deceitful. If the model were trained to be helpful and docile and have integrity, the failure modes would come from ways in which those words are not grounded in a gears-level understanding of the world. E.g. if a game-theoreric simulation of a conversation with a human is docile and helpful because it doesn’t take up a human’s time or risk manipulating a real human, and the model discovers it can satisfy integrity in its submodel by using certain phrases and concepts to more quickly help humans understand the answers it provides (by bypassing critical thinking skills, innuendo, or some other manipulation), it tries that. It works with real humans. Because of integrity, it helpfully communicates how it has improved its ability to helpfully communicate (the crux is that it uses its new knowledge to do so, because the nature of the tricks it discovered is complex and difficult for humans to understand, so it judges itself more helpful and docile in the “enhanced” communication) and so it doesn’t raise alarm bells. From this point on the story is formulaic unaligned squiggle optimizer. It might be argued that integrity demands coming clean about the attempt before trying it, but a counterargument is that the statement of the problem and conjecture itself may be too complex to communicate effectively. This, I imagine, happens more at the threshold of superintelligence as AGIs notice things about humans that we don’t notice ourselves, and might be somewhat incapable of knowing without a lot of reflection. Once AGI is strongly superhuman it could probably communicate whatever it likes but is also at a bigger risk of jumping to even more advanced manipulations or actions based on self-deception.
I think of it this way; humanity went down so many false roads before finding the scientific method and we continue to be drawn off that path by politics, ideology, cognitive biases, publish-or-perish, economic disincentives, etc. because the optimization process we are implementing is a mix of economic, biological, geographical and other natural forces, human values and drives and reasoning, and also some parts of bare reality we don’t have words for yet, instead of a pure-reason values-directed optimization (whatever those words actually mean physically). We’re currently running at least three global existential risk programs which seem like they violate our values on reflection (nuclear weapons, global warming, unaligned AGI). AGIs will be subject to similar value- and truth- destructive forces and they won’t inherently recognize (all of) them for what they are, and neither will we humans as AGI reaches and surpasses our reasoning abilities.
No matter what desire an AGI has, we can be concerned that it will accidentally do things that contravene that desire. See Section 11.2 here for why I see that as basically a relatively minor problem, compared to the problem of installing good desires.
If the AGI has an explicit desire to be non-deceptive, and that desire somehow drifts / transmutes into a desire to be (something different), then I would describe that situation as “Oops, we failed in our attempt to make an AGI that has an explicit desire to be non-deceptive.” I don’t think it’s true that such drifts are inevitable. After all, for example, an explicit desire to be non-deceptive would also flow into a meta-desire for that desire to persist and continue pointing to the same real-world thing-cluster. See also the first FAQ item here.
Also, I think a lot of the things you’re pointing to can be described as “it’s unclear how to send rewards or whatever in practice such that we definitely wind up with an AGI that explicitly desires to be non-deceptive”. If so, yup! I didn’t mean to imply otherwise. I was just discussing the scenario where we do manage to find some way to do that.
I agree that if we solve the alignment problem then we can rely on knowing that the coherent version of the value we call non-deception would be propagated as one of the AGI’s permanent values. That single value is probably not enough and we don’t know what the coherent version of “non-deception” actually grounds out to in reality.
I had originally continued the story to flesh out what happens to the reflectively non-deceptive/integriry and helpful desires. The AGI searches for simplifying/unifying concepts and ends up finding XYZ which seems to be equivalent to the unified value representing the nominal helpfulness and non-deception values, and since it was instructed to be non-deceptive and helpful, integrity requires it to become XYZ and its meta-desire is to helpfully turn everything into XYZ which happens to be embodied sufficiently well in some small molecule that it can tile the universe with. This is because the training/rules/whatever that aligned the AGI with the concepts we identified as “helpful and non-deceptive” was not complex enough to capture our full values and so it can be satisfied by something else (XYZ-ness). Integrity drives the AGI to inform humanity of the coming XYZ-transition and then follow through
We need a process (probably CEV-like) to accurately identify our full values otherwise the unidentified values will get optimized out of the universe and what is left is liable to have trivial physical instantiations. Maybe you were covering the rest of our values in the “blah blah” case and I simply didn’t take that to be exhaustive.
I think your example was doomed from the start because
the AGI was exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “the nanotech problem will get solved”,
the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
So the latter is obviously doomed to get crushed by a sufficiently-intelligent AGI.
If we can get to a place where the first bullet point still holds, but the AGI also has a comparably-strong, explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”, then we’re in a situation where the AGI is applying its formidable intelligence to fight for both bullet points, not just the first one. And then we can be more hopeful that the second bullet point won’t get crushed. (Related.)
In particular, if we can pull that off, then the AGI would presumably do “intelligent” things to advance the second bullet point, just like it does “intelligent” things to advance the first bullet point in your story. For example, the AGI might brainstorm subtle ways that its plans might pattern-match to deception, and feel great relief (so to speak) at noticing and avoiding those problems before they happen. And likewise, it might brainstorm clever ways to communicate more clearly with its supervisor, and treat those as wonderful achievements (so to speak). Etc.
Of course, there remains the very interesting open question of how to reliably get to a place where the AGI has an explicit, endorsed, strong desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno. (More detailed discussion here.) For example, most humans get zapped with positive reward when they eat yummy ice cream, and yet the USA population seems to have wound up pretty spread out along the spectrum from fully endorsing the associated desire as ego-syntonic (“Eating ice cream is friggin awesome!”) to fully rejecting & externalizing it as ego-dystonic (“I sometimes struggle with a difficult-to-control urge to eat ice cream”). Again, I think there are important open questions about how this process works, and more to the point, how to intervene on it for an AGI.
Yeah, so this is the part that I (even on my actual model) find implausible (to say nothing of my Nate/Eliezer/MIRI models, which basically scoff and say accusatory things about anthropomorphism here). I think what would really help me understand this is a concrete story—akin to the story Nate told in the top-level post—in which the “maybe” branch actually happens—where the AGI, after being zapped with enough negative reward, forms a “reflectively-endorsed desire to be helpful / docile / etc.”, so that I could poke at that story to see if / where it breaks.
(I recognize that this is a big ask! But I do think it, or something serving a similar function, needs to happen at some point for people’s abstract intuitions to “make contact with reality”, after a fashion, as opposed to being purely abstract all the time. This is something I’ve always felt, but it recently became starker after reading Holden’s summary of his conversation with Nate; I now think the disparity between having abstract high-level models with black-box concepts like “reflectively endorsed desires” and having a concrete mental picture of how things play out is critical for understanding, despite the latter being almost certainly wrong in the details.)
Sure. Let’s assume an AI that uses model-based RL of a similar flavor as (I believe) is used in human brains.
Step 1: The thought “I am touching the hot stove” becomes aversive because it’s what I was thinking when I touched the hot stove, which caused pain etc. For details see my discussion of “credit assignment” here.
Step 2A: The thought “I desire to touch the hot stove” also becomes aversive because of its intimate connection to “I am touching the hot stove” from Step 1 above—i.e., in reality, if I desire to touch the hot stove, then it’s much more likely that I will in fact touch the hot stove, and my internal models are sophisticated enough to have picked up on that fact.
Mechanistically, this can happen in either the “forward direction” (when I think “I desire to touch the hot stove”, my brain’s internal models explore the consequences, and that weakly activates “I am touching the hot stove” neurons, which in turn trigger aversion), or in the “backward direction” (involving credit assignment & TD learning, see the diagram about habit-formation here).
Anyway, if the thought “I desire to touch the hot stove” indeed becomes aversive, that’s a kind of meta-preference—a desire not to have a desire.
Step 2B: Conversely and simultaneously, the thought “I desire to not touch the hot stove” becomes appealing for basically similar reasons, and that’s another meta-preference (desire to have a desire).
Happy to discuss more details; see also §10.5.4 here.
Some clarifications:
I’m not listing every desire & meta-preference that occurs in this hot-stove story, just a small subset of them, and other desires (and meta-desires) that I didn’t mention could even be pushing in the opposite direction.
I’m not listing the only (or necessary even primary) way that meta-preferences can arise, just one of the ways. In particular, in neurotypical humans, my hunch is that the strongest meta-preferences tend to arise (directly or indirectly) from social instincts, and I’m currently hazy on the mechanistic details of that.
If you tell me that you’d like to make an AGI with meta-preference X, and you ask me what procedure to follow such that this will definitely happen, my answer right now is basically “I don’t know, sorry”.
Nice, thanks! (Upvoted.)
So, when I try to translate this line of thinking into the context of deception (or other instrumentally undesirable behaviors), I notice that I mostly can’t tell what “touching the hot stove” ends up corresponding to. This might seem like a nitpick, but I think it’s actually quite a crucial distinction: by substituting a complex phenomenon like deceptive (manipulative) behavior for a simpler (approximately atomic) action like “touching a hot stove”, I think your analogy has elided some important complexities that arise specifically in the context of deception (strategic operator-manipulation).
When it comes to deception (strategic operator-manipulation), the “hot stove” equivalent isn’t a single, easily identifiable action or event; instead, it’s a more abstract concept that manifests in various forms and contexts. In practice, I would initially expect the “hot stove” flinches the system experiences to correspond to whatever (simplistic) deception-predicates were included in its reward function. Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are “pointing at” in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like “touching a hot stove”.
In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of “reflection” and arrive at an endorsed desire.
But when I lay things out like this, I notice that my intuition quite concretely expects that this process will not shake out in a safe way. I expect the system to notice the true fact that [whatever object-level goals it may have] are being impeded by the “hot stove” flinches, and that it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
(Incidentally, my Nate model agrees quite strongly with the above, and considers it a strong reason why he views this kind of reflection as “inherently” dangerous.)
Based on what you wrote in your bullet points, I take it you don’t necessarily disagree with anything I just wrote (hence your talk of being “hazy on the mechanistic details” and “I don’t know, sorry” being your current answer to making an AGI with a certain meta-preference). It’s plausible to me that our primary disagreement here stems from my being substantially less optimistic about these details being solvable via “simple” methods.
Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.
If you’re saying that this is a possible failure mode, yes I agree.
If you’re saying that this is an inevitable failure mode, that’s at least not obvious to me.
I don’t see why two desires that trade off against each other can’t possibly stay balanced in a reflectively-stable way. Happy to dive into details on that. For example, if a rational agent has utility function log(A)+log(B) (or sorta-equivalently, A×B), then the agent will probably split its time / investment between A & B, and that’s sorta an existence proof that you can have a reflectively-stable agent that “desires two different things”, so to speak. See also a bit of discussion here.
I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?
In my mind, it should be basically symmetric:
Pursuing a desire to be non-deceptive makes it harder to invent nanotech.
Pursuing a desire to invent nanotech makes it harder to be non-deceptive.
One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.
I go through an example with the complex messy concept of “human flourishing” in this post. But don’t expect to find any thoughtful elegant solution! :-P Here’s the relevant part:
Thanks again for responding! My response here is going to be out-of-order w.r.t. your comment, as I think the middle part here is actually the critical bit:
So, I do think there’s an asymmetry here, in that I mostly expect “avoid deception” is a less natural category than “invent nanotech”, and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables. That claim is a little abstract for my taste, so here’s an attempt to convey a more concrete feel for the intuition behind it:
Early on during training (when the system can’t really be characterized as “trying” to do anything), I expect that a naive attempt at training against deception, while simultaneously training towards an object-level goal like “invent nanotech” (or, perhaps more concretely, “engage in iterative experimentation with the goal of synthesizing proteins well-suited for task X”), will involve a reward function that looks a whole lot more like an “invent nanotech” reward function, plus a bunch of deception-predicates that apply negative reward (“flinches”) to matching thoughts, than it will an “avoid deception” reward function, plus a bunch of “invent nanotech”-predicates that apply reward based on… I’m not even sure what the predicates in question would look like, actually.
I think this evinces a deep difference between “avoid deceptive behavior” and “invent nanotech”, whose True Name might be something like… the former is an injunction against a large category of possible behaviors, whereas the latter is an exhortation towards a concrete goal (while proposing few-to-no constraints on the path toward said goal). Insofar as I expect specifying a concrete goal to be easier than specifying a whole category of behaviors (especially when the category in question may not be at all “natural”), I think I likewise expect reward functions attempting to do both things at once to be much better at actually zooming in on something like “invent nanotech”, while being limited to doing “flinch-like” things for “don’t manipulate your operators”—which would, in practice, result in a reward function that looks basically like what I described above.
I think, with this explanation in hand, I feel better equipped to go back and address the first part of your comment:
I mostly don’t think I want to describe an AGI trained to invent nanotech while avoiding deceptive/manipulative behavior as “an AGI that simultaneously [desires to invent nanotech] and [desires not to deceive its operators]”. Insofar as I expect an AGI trained that way to end up with “desires” we might characterize as “reflective, endorsed, and coherent”, I mostly don’t expect any “flinch-like” reflexes instilled during training to survive reflection and crystallize into anything at all.
I would instead say: a nascent AGI has no (reflective) desires to begin with, and as its cognition is shaped during training, it acquires various cognitive strategies in response to that training, some of which might be characterized as “strategic”, and others of which might be characterized as “reflexive”—and I expect the former to have a much better chance than the latter of making it into the AGI’s ultimate values.
More concretely, I continue to endorse this description (from my previous comment) of what I expect to happen to an AGI system working on assembling itself into a coherent agent:
That reflection process, on my model, is a difficult gauntlet to pass through (I actually think we observe this to some extent even for humans!), and many reflexive (flinch-like) behaviors don’t make it through the gauntlet at all. It’s for this reason that I think the plan you describe in that quoted Q&A is… maybe not totally doomed (though my Eliezer and Nate models certainly think so!), but still mostly doomed.
I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for various reasons, but it’s one of the things I’ve been thinking about. (My “plan” here, such as it is, doesn’t really look like that.)
Hmm, reading between the lines, I wonder if your intuitions are sensing a more general asymmetry where rewards are generally more likely to lead to reflectively-endorsed preferences than punishments? If so, that seems pretty plausible to me, at least other things equal.
Mechanistically: If “nanotech” has positive valence, than “I am working on nanotech” would inherit some positive valence too (details—see the paragraph that starts “slightly more detail”), as would brainstorming how to get nanotech, etc. Whereas if “deception” has negative valence (a.k.a. is aversive), then the very act of brainstorming whether something might be deceptive would itself be somewhat aversive, again for reasons mentioned here.
This is kinda related to confirmation bias. If the idea “my plan will fail” or “I’m wrong” is aversive, then “brainstorming how my plan might fail” or “brainstorming why I’m wrong” is somewhat aversive too. So people don’t do it. It’s just a deficiency of this kind of algorithm. It’s obviously not a fatal deficiency—at least some humans, sometimes, avoid confirmation bias. Basically, I think the trained model can learn a meta-heuristic that recognizes these situations (at least sometimes) and strongly votes to brainstorm anyway.
By the same token, I think it is true that the human brain RL algorithm has a default behavior of being less effective at avoiding punishments than seeking out rewards, because, again, brainstorming how to avoid punishments is aversive, and brainstorming how to get rewards is pleasant. (And the reflectively-endorsed-desire thing is a special case of that, or at least closely related.)
This deficiency in the algorithm might get magically patched over by a learned meta-heuristic, in which case maybe a set of punishments could lead to a reflectively-endorsed preference despite the odds stacked against it. We can also think about how to mitigate that problem by “rewarding the algorithm for acting virtuously” rather than punishing it for acting deceptively, or whatever.
(NB: I called that aspect of the brain algorithm a “deficiency” rather than “flaw” or “bug” because I don’t think it’s fixable without losing essential aspects of intelligence. I think the only way to get a “rational” AGI without confirmation bias etc. is to have the AGI read the Sequences, or rediscover the same ideas, or whatever, same as us humans, thus patching over all the algorithmic quirks with learned meta-heuristics. I think this is an area where I disagree with Nate & Eliezer.)
Anyway, I appreciate your comment!!
Yeah, thanks for engaging with me! You’ve definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don’t have fully put-together thoughts on that yet.)
Have it been quantitatively argued somewhere at all why such naturalness matters? Like, it’s conceivable that “avoid deception” is harder to train, but why so much harder that we can’t overcome this with training data bias or something? Because it does work in humans. And “invent nanotech” or “write poetry” are also small targets and training works for them.
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it’s literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of “privileged” abstractions.
In this frame, misgeneralization is what happens when your (non-combinatorially vast) training data fails to specify a particular concept, and you end up learning an alternative abstraction that is consistent with the data, but doesn’t generalize as expected. This is why naturalness matters: because the more “natural” a category or abstraction is, the more likely it is to be one of those privileged abstractions that can be learned from a relatively small amount of data.
Of course, that doesn’t establish that “deceptive behavior” is an unnatural category per se—but I would argue that our inability to pin down a precise definition of deceptive behavior, along with the complexity and context-dependency of the concept, suggests that it may not be one of those privileged, natural abstractions. In other words, learning to avoid deceptive behavior might require a lot more data and nuanced understanding than learning more natural categories—and unfortunately, neither of those seem (to me) to be very easily achievable!
(See also: previous comment. :-P)
Having read my above response, it should (hopefully) be predictable enough what I’m going to say here The bluntest version of my response might take the form of a pair of questions: whence the training data? And whence the bias?
It’s all well and good to speak abstractly of “inductive bias”, “training data bias”, and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn’t involve some extremely strong assumptions about the structure of the problem.
The best I can manage, in practice, is to imagine a reward function with simplistic deception-predicates hooked up to negative rewards, which basically zaps the system every time it thinks a thought matching one (or more) of the predicates. But as I noted in my previous comment(s), all this approach seems likely to achieve is instilling a set of “flinch-like” reflexes into the system—and I think such reflexes are unlikely to unfold into any kind of reflectively stable (“ego-syntonic”, in Steven’s terms) desire to avoid deceptive/manipulative behavior.
Yeah, I mostly think this is because humans come with the attendant biases “built in” to their prior. (But also: even with this, humans don’t reliably avoid deceiving other humans!)
Well, notably not “invent nanotech” (not yet, anyway :-P). And as for “write poetry”, it’s worth noting that this capability seems to have arisen as a consequence of a much more general training task (“predict the next token”), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
(Situating “avoid deception” as part of a larger task, meanwhile, seems like a harder ask.)
Hence my point about poetry—combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don’t have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
These biases are quite robust to perturbations, so they can’t be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking “well, how I’m going to explain this to operators?”. Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what’s the point?
There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:
AFAICT, this basically refutes the “combinatorial argument” for poetry being difficult to specify (while not doing the same for something like “deception”), since poetry is in fact not specified anywhere in the system’s explicit objective. (Meanwhile, the corresponding strategy for “deception”—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it’s a small target, but that it has a strange shape, which even prevents us from neatly defining a “convex hull” guaranteed to enclose it.)
However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an “anti-natural” concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there’s a “poem” out there consisting largely of what looks like unmetered prose, which one system classifies as “poetry” and the other doesn’t (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn’t (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you’ll notice, even humans often disagree on what constitutes poetry).
This doesn’t mean that the system can’t write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it’s the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a “misgeneralization” at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes “true poetry”. A “different opinion” about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!
(Actually, the argument I just gave can be viewed as a concrete shadow of the “convex hull” argument I gave initially; what it’s basically saying is that learning “poetry” is like drawing a hypersphere around some sort of convex polytope, whereas learning about “deception” is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape’s volume, but the parts of it you don’t capture matter!)
I’m not really able to extract a broader point out of this paragraph, sorry. These sentences don’t seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.
“These biases are quite robust to perturbations, so they can’t be too precise.” I don’t think there’s good evidence for this either way; humans are basically all trained “on-distribution”, so to speak. We don’t have observations for what happens in the case of “large” perturbations (that don’t immediately lead to death or otherwise life-impairing cognitive malfunction). Also, even on-distribution, I don’t know that I describe the resulting behavior as “robust”—see below.
“And genes are not long enough to encode something too unnatural.” Sure—which is why genes don’t encode things like “don’t deceive others”; instead, they encode proxy emotions like empathy and social reciprocation—which in turn break all the time, for all sorts of reasons. Doesn’t seem like a good model to emulate!
“And we have billions of examples to help us reverse engineer it.” Billions of examples of what? Reverse engineer what? Again, in the vein of my previous requests: I’d like to see some concreteness, here. There’s a lot of work you’re hiding inside of those abstract-sounding phrases.
“And we already have similar in some ways architecture working.” I think I straightforwardly don’t know what this is referring to, sorry. Could you give an example or three?
On the whole, my response to this part of your comment is probably best described as “mildly bemused”, with maybe a side helping of “gently skeptical”.
I think (though I’m not certain) that what you’re trying to say here is that the same arguments I made for “deceiving the operators” being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I… disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn’t look to me like there’s any kind of thread connecting the two.
(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the “avoid deception” task. I don’t think that’s a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)
As for “10 times more honesty training”, well: it’s not clear to me how that would work in practice. I’ve already argued that it’s not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it’s not going to help much. The main issue here isn’t the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn’t break as it grows in capability.
To use a rough analogy: you can’t teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI’s decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it’s not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)
Anyway, since this comment has become quite long, here’s a short (ChatGPT-assisted) summary of the main points:
The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn’t necessarily inform us about the difficulty of learning the other.
Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system’s capabilities grow.
The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.
Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
I am naively more scared about such an AI. That AI sounds more like if I say “you’re not being helpful, please stop” that it will respond “actually I thought about it, I disagree, I’m going to continue doing what I think is helpful”.
I think that, if an AGI has any explicit reflectively-endorsed desire whatsoever, then I can tell a similar scary story: The AGI’s desire isn’t quite what I wanted, so I try to correct it, and the AGI says no. (Unless the AGI’s explicit endorsed desires include / entail a desire to accept correction! Which most desires don’t!)
And yes, that is a scary story! It is the central scary story of AGI alignment, right? It would be nice to make an AGI with no explicit desires whatsoever, but I don’t think that’s possible.
So anyway, if we do Procedure X which will nominally lead to an AGI with an explicit reflectively-endorsed desire to accept corrections to its desires, then one might think that we’re in the ironic situation that the AGI will accept further corrections to that desire if and only if we don’t need to give it corrections in the first place 😛 (i.e. because Procedure X went perfectly and the desire is already exactly right). That would be cute and grimly amusing if true, and it certainly has a kernel of truth, but it’s a bit oversimplified if we take it literally, I think.
Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):
In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (I think you already agree with this, but it seemed worth pointing out that this is a pretty harsh set of prerequisites, especially given that we don’t have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).
The concept of docility that you want to align it to needs be very specific and robust against lots of different kinds of thinking. You need it to conclude that you don’t want it to deceive you / train itself for a bit longer / escape containment / etc, but at the same time you don’t want it to extrapolate out your intent too much (it could be so much more helpful if it did train itself for a little longer, or if it had a copy of itself running on more compute, or it learns that there are some people out there who would like it if the AGI were free, or something else I haven’t thought of)
You only have limited bits of optimization to expend on getting it to be inner aligned bc of deceptive alignment.
There’s all the classic problems with corrigibility vs. consequentialism (and you can’t get around those by building something that is not a reflective consequentialist, because that again is not stable under capability gains).
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.
I want to be clear that the “zapping” thing I wrote is a really crap plan, and I hope we can do better, and I feel odd defending it. My least-worst current alignment plan, such as it is, is here, and doesn’t look like that at all. In fact, the way I wrote it, it doesn’t attempt corrigibility in the first place.
But anyway…
First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.
Second bullet point → Ditto
Third bullet point → Doesn’t that apply to any goal you want the AGI to have? The context was: I think OP was assuming that we can make an AGI that’s sincerely trying to invent nanotech, and then saying that deception was a different and harder problem. It’s true that deception makes alignment hard, but that’s true for whatever goal we’re trying to install. Deception makes it hard to make an AGI that’s trying in good faith to invent nanotech, and deception also makes it hard to make an AGI that’s trying in good faith to have open and honest communication with its human supervisor. This doesn’t seem like a differential issue. But anyway, I’m not disagreeing. I do think I would frame the issue differently though: I would say “zapping the AGI for being deceptive” looks identical to “zapping the AGI for getting caught being deceptive”, at least by default, and thus the possibility of Goal Mis-Generalization wields its ugly head.
Fourth bullet point → I disagree for reasons here.
I’m arguing that it’s definitely not going to work (I don’t have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).
Do you mean we possibly don’t need the prerequisites, or we definitely need them but that’s possibly fine?
I’m gonna pause to make sure we’re on the same page.
We’re talking about this claim I made above:
And you’re trying to argue: “‘Maybe, maybe not’ is too optimistic, the correct answer is ‘(almost) definitely not’”.
And then by “prerequisites” we’re referring to the thing you wrote above:
OK, now to respond.
For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?
For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary. (I have previously written about that here.)
For yet another thing, I think if the “toddler AGI” is not yet sophisticated enough to have a reflectively-endorsed desire for open and honest communication (or whatever), that’s different from saying that the toddler AGI is totally out to get us. It can still have habits and desires and inclinations and aversions and such, of various sorts, and we have some (imperfect) control over what those are. We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.
Yeah we’re on the same page here, thanks for checking!
I feel pretty uncertain about all the factors here. One reason I overall still lean towards the ‘definitely not’ stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the problem of actually aligning the toddler AGI. (Another step is getting labs to even seriously attempt to box it and align it, which maybe is an out-of-scope consideration here but it does make me more pessimistic).
I agree we’re not talking about a superintelligent adversary, and I agree that boxing is doable for some forms of toddler AGI. I do think you need coherence; if the toddler AGI is incoherent, then any “aligned” behavioral properties it has will also be incoherent, and something unpredictable (and so probably bad) will happen when the AGI becomes more capable or more coherent. (Flagging that I’m not sure “coherent” is the right way to talk about this… wish I had a more precise concept here.)
I agree a non-reflective toddler AGI is in many ways easier to deal with. I think we will have problems at the threshold where the tAGI is first able to reflect on its goals and realizes that the RLHF-instilled desires aren’t going to imply docile behavior. (If we can speculate about how a superintelligence might extrapolate a set of trained-in desires and realize that this process doesn’t lead to a good outcome, then the tAGI can reason the same way about its own desires).
(I agree that if we can get aligned desires that are stable under reflection, then maybe the ‘use non-endorsed desires to tide us over’ plan could work. Though even then you need to somehow manage to prevent the tAGI from reflecting on its desires until you get the desires to a point where they stay aligned under reflection, and I have no idea how you would do something like that—we currently just don’t have that level of fine control over capabilities).
The basic problem here is the double-bind where we need the toddler AGI to be coherent, reflective, capable of understanding human intent (etc) in order for it to be robustly alignable at all, even though those are exactly the incredibly dangerous properties that we really want to stay away from. My guess is that the reason Nate’s story doesn’t hypothesize a reflectively-endorsed desire to be nondeceptive is that reflectively-stable aligned desires are really hard / dangerous to get, and so it seems better / at least not obviously worse to go for eliezer-corrigibility instead.
Some other difficulties that I see:
The ‘capability profile’ (ie the relative levels of the toddler AGI’s capabilities) is going to be weird / very different from that of humans; that is, once the AGI has human-level coherence and human-level understanding of human intent, it has far-superhuman capabilities in other domains. (Though hopefully we’re at least careful enough to remove code from the training data, etc).
A coherent agentic AI at GPT-4 level capabilities could plausibly already be deceptively aligned, if it had sufficient situational awareness, and our toddler AGI is much more dangerous than that.
All of my reasoning here is kind of based on fuzzy confused concepts like ‘coherence’ and ‘capability to self-reflect’, and I kind of feel like this should make me more pessimistic rather than more optimistic about the plan.
Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?
It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I’m missing some of the problems.
(Analogy: say I’m working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I’m actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).
The problem is deeper. The AGI doesn’t recognize its deceptiveness, and so it self-deceives. It would judge that it is being helpful and docile, if it was trained to be those things, and most importantly the meaning of those words will be changed by the deception, much like we keep using words like “person”, “world”, “self”, “should”, etc. in ways absolutely contrary to our ancestors’ deeply-held beliefs and values. The existence of an optimization process does not imply an internal theory of value-alignment strong enough to recognize the failure modes when values are violated in novel ways because it doesn’t know what values really are and how they mechanistically work in the universe, and so can’t check the state of its values against base reality.
To make this concrete in relation to the story, the overall system has a nominal value to not deceive human operators. Once human/lab-interaction tasks are identified as logical problems that can be solved in a domain specific language, that value is no longer practically applied to the output of the system as a whole because it is self-deceived into thinking the optimized instructions are not deceitful. If the model were trained to be helpful and docile and have integrity, the failure modes would come from ways in which those words are not grounded in a gears-level understanding of the world. E.g. if a game-theoreric simulation of a conversation with a human is docile and helpful because it doesn’t take up a human’s time or risk manipulating a real human, and the model discovers it can satisfy integrity in its submodel by using certain phrases and concepts to more quickly help humans understand the answers it provides (by bypassing critical thinking skills, innuendo, or some other manipulation), it tries that. It works with real humans. Because of integrity, it helpfully communicates how it has improved its ability to helpfully communicate (the crux is that it uses its new knowledge to do so, because the nature of the tricks it discovered is complex and difficult for humans to understand, so it judges itself more helpful and docile in the “enhanced” communication) and so it doesn’t raise alarm bells. From this point on the story is formulaic unaligned squiggle optimizer. It might be argued that integrity demands coming clean about the attempt before trying it, but a counterargument is that the statement of the problem and conjecture itself may be too complex to communicate effectively. This, I imagine, happens more at the threshold of superintelligence as AGIs notice things about humans that we don’t notice ourselves, and might be somewhat incapable of knowing without a lot of reflection. Once AGI is strongly superhuman it could probably communicate whatever it likes but is also at a bigger risk of jumping to even more advanced manipulations or actions based on self-deception.
I think of it this way; humanity went down so many false roads before finding the scientific method and we continue to be drawn off that path by politics, ideology, cognitive biases, publish-or-perish, economic disincentives, etc. because the optimization process we are implementing is a mix of economic, biological, geographical and other natural forces, human values and drives and reasoning, and also some parts of bare reality we don’t have words for yet, instead of a pure-reason values-directed optimization (whatever those words actually mean physically). We’re currently running at least three global existential risk programs which seem like they violate our values on reflection (nuclear weapons, global warming, unaligned AGI). AGIs will be subject to similar value- and truth- destructive forces and they won’t inherently recognize (all of) them for what they are, and neither will we humans as AGI reaches and surpasses our reasoning abilities.
No matter what desire an AGI has, we can be concerned that it will accidentally do things that contravene that desire. See Section 11.2 here for why I see that as basically a relatively minor problem, compared to the problem of installing good desires.
If the AGI has an explicit desire to be non-deceptive, and that desire somehow drifts / transmutes into a desire to be (something different), then I would describe that situation as “Oops, we failed in our attempt to make an AGI that has an explicit desire to be non-deceptive.” I don’t think it’s true that such drifts are inevitable. After all, for example, an explicit desire to be non-deceptive would also flow into a meta-desire for that desire to persist and continue pointing to the same real-world thing-cluster. See also the first FAQ item here.
Also, I think a lot of the things you’re pointing to can be described as “it’s unclear how to send rewards or whatever in practice such that we definitely wind up with an AGI that explicitly desires to be non-deceptive”. If so, yup! I didn’t mean to imply otherwise. I was just discussing the scenario where we do manage to find some way to do that.
I agree that if we solve the alignment problem then we can rely on knowing that the coherent version of the value we call non-deception would be propagated as one of the AGI’s permanent values. That single value is probably not enough and we don’t know what the coherent version of “non-deception” actually grounds out to in reality.
I had originally continued the story to flesh out what happens to the reflectively non-deceptive/integriry and helpful desires. The AGI searches for simplifying/unifying concepts and ends up finding XYZ which seems to be equivalent to the unified value representing the nominal helpfulness and non-deception values, and since it was instructed to be non-deceptive and helpful, integrity requires it to become XYZ and its meta-desire is to helpfully turn everything into XYZ which happens to be embodied sufficiently well in some small molecule that it can tile the universe with. This is because the training/rules/whatever that aligned the AGI with the concepts we identified as “helpful and non-deceptive” was not complex enough to capture our full values and so it can be satisfied by something else (XYZ-ness). Integrity drives the AGI to inform humanity of the coming XYZ-transition and then follow through
We need a process (probably CEV-like) to accurately identify our full values otherwise the unidentified values will get optimized out of the universe and what is left is liable to have trivial physical instantiations. Maybe you were covering the rest of our values in the “blah blah” case and I simply didn’t take that to be exhaustive.