Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning
Context: somebody at some point floated the idea that Ronny might (a) understand the argument coming out of the Quintin/Nora camp, and (b) be able to translate them to Nate. Nate invited Ronny to chat. The chat logs follow, lightly edited.
The basic (counting) argument
Are you mostly interested in Quintin’s newest post?
I haven’t read it but I don’t suspect it’s his best
I’m more interested in something like “what are the actual arguments here”.
I’m less interested in “ronny translates for others” and more interested in “what ronny believes after having spoken to others”, albeit with a focus on the arguments that others are making that various locals allegedly buy.
Sweet that’s way better
Options: (a) i start asking questions; (b) you poke me when you wanna chat; (c) you monologue a bit about places where you think you know something i don’t
and obviously (d) other, choose your own adventure
Let’s start real simple. Here is the basic argument from my point of view:
If there’s a superintelligence with goals very different from mine, things are gonna suck real bad.
There will be a superintelligence.
Its goals will be very different from mine.
Therefore: Things will suck real bad.
I totally buy 1 and 2, and find 3 extremely plausible, but less so than I used to for reasons I will explain later. Just curious if you are down with calling that the basic argument for now.
works for me!
and, that sounds correct to me
Some points:
One of the main things that has me believing 3 is a sort of counting argument.
Goals really can just be anything, and we’re only selecting on behavior.
Corrigibility is in principle possible, but seems really unnatural.
It makes sense to behave pretty much like a mind with my goals if you’re smart enough to figure out what’s going on, until you get a good coup opportunity, and you coup.
So like P(good behavior | my values) ~ P(good behavior | not my values) so the real question is P(my values),which seems real small
I agree with 1-4 and am shaky on 5
4 is slightly non-sequiturish, though true
4 is to establish 5
good behaviour in training, to be clear
my objection to 5 is “so the real question is”; i don’t super buy the frame; there are things to look at like the precise behavior trajectory and the mind internals, and the “real question” involves that stuff etc.
(This is going to be easier for me if I let myself devil’s advocate more than I think is maximally epistemically healthy. I’m gonna do a bit of that.)
Ok, so here’s an analogy on the counting argument. If you were to naively count the ways gas in the room might be, you would find that many of them kill you. This is true if you do max entropy over ways it could be described in English. However, if you do max ent over the parameters of the individual particles in the gas, you find that almost never do they kill you. It’s also true that if you count the superintelligent programs of length n, almost all of them kill you, but you shouldn’t do max ent over the programs in python or whatever, you should do max ent over the parameters, and then condition that on stochastic gradient descent. This might well tend to always average out to finding a model that straightforwardly tries to cause something a lot like what your loss function is pointing at.
So from my actual point of view, it seems like a lot depends on what the machine learning prior is like, and I don’t have much of a clue what that’s like
a thing i agree with: “if you arrange air particles by sampling a program (according to simplicity) and letting that program arrange them, most resulting configurations kill you. if instead you arrange air particles by sampling an entire configuration (sampled uniformly) then most resulting configurations don’t kill you.” (this is how the physical side of the analogy translates, in my language.)
i don’t understand what analogy you’re trying to draw from there; i don’t understand what things are “programs” and what things are “parameters”
if i sorta squint at your argument, it sounds like you’re trying to say something like “i think that you, nate, think that superintelligent goals are likely to be more like a randomly sampled program, but i think that for all we know maybe inner alignment happens basically automatically”
i don’t understand how your analogy is supposed to be an argument for that claim though
it seems perhaps worth mentioning that my reasons for expecting inner misaligment are not fundamentally “because i know so little that i must assume the goals are random”, but is built of more knowledge than this
Ok cool, my basic argument is a counting argument
Like basically alignment and corrigibility are high complexity
Disjunction of all other goals plus scheming is much higher weight
insofar as you’re trying to use that argument to be like “this is baby’s first argument for other goals being plausible at all, and thus we shouldn’t write off the risk”, i’m like “sure”
insofar as you’re like “and this is the main/strongest argument for the goals turning out elsewise, which i shall now undermine” i’m like “nope”
Oh nah, this is my primary argument for other goals are much more likely
I think few people do
(the “plus scheming” also implies to me a difference in our models, i note parenthetically while following the policy of noting each place that something feels off)
(Agreed scheming is baked in)
Ok cool, I just don’t know the other arguments
well, this is one place that the analogy from evolution slots in
i could gesture at other arguments, or i could listen to you undermine the argument that you consider primary
to be clear i do think that this primary argument serves as a sort of ignorance prior, later modified by knowledge
So I always saw the evolution analogy as at best being an existence proof, and a good one, but I don’t see what else it is supposed to tell me
I’m interested in the other arguments and interested in fleshing out the analogy
Especially if we could say it as not an analogy
I also think me and my reward mechanisms (or whatever), which I am similarly very misaligned with, are a good analogy
Evolution / Reflection Process is Path Dependent
well the rough form of the argument is “goals aren’t selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex”
with the force of the argument here going through something like “there are lots of ways for a mind to be architected and few ways for it to be factored around a goal”
at which point we invoke a sort of ignorance prior not over the space of goals but the mechanics of the mind
which is then further (negatively) modified by the practicalities of a mind that must work ok while still stupid
(and at this juncture i interpret the shard theory folk as arguing something like “well the shards that humans build their values up around are very proximal to minds; e.g. perhaps curiosity is instrumentally useful for almost any large-world task and human-esque enjoyment-of-curiosity is actually a near-universal (or at least common) architecture-independent environment-independent strategy for achieving that instrumental value, and we should expect it to get baked into any practical mind’s terminal values in the same way it was baked into ours (or at least find this pretty plausible)”, or something?)
(which seems kinda crazy to me, but perhaps i don’t understand the argument yet and perhaps i shouldn’t be trying to run ahead and perhaps i shouldn’t be trying to argue against other people through you)
well the rough form of the argument is “goals aren’t selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex”
I’m not sure exactly what I am supposed to get out of this? Minds will tend to be terminally into stuff that is instrumentally useful to the goal of the outer optimizer?
Seems like you’re saying “not random, this other way” what’s the other way?
that’s a fine specific piece, sure. the more general piece is “there are lots and lots of ways for a mind to achieve low training-loss at a fixed capability level”
Ok yeah agreed. I didn’t mean to say that you’re selecting a goal, should’ve said program in general
Seems like you’re saying “not random, this other way” what’s the other way?
not sure what you think the difference is between “not random” and “uniformly random but according to a different measure”; from my perspective i’m basically saying “we can move from random over the space of goals, to random over the space of mind architectures, to random over the space of mind architectures that have to perform well-enough while stupid, to random over the space of mind architectures which are trained using some sort of stochastic gradient descent, to random over the space of mind architectures that consist of training [some specific architecture] using [some specific optimizer]” and i’m like “yep it all looks pretty dour to me”
where it seemed to me like you were trying to say something like “i agree that random over general programs is bad, but for all i know, random over mind architectures has a high chance of being good” and i’m like “hmm well it sounds like we have a disagreement then”
not sure what you think the difference is between “not random” and “uniformly random but according to a different measure”; from my perspective i’m basically saying “we can move from random over the space of goals, to random over the space of mind architectures, to random over the space of mind architectures that have to perform well-enough while stupid, to random over the space of mind architectures when trained using some sort of SGD, to random over the space of mind architectures when training [some specific architecture] using [some specific optimizer]” and i’m like “yep it all looks pretty dour to me”
I have specific reasons for being like, if you were selecting a python program that was superintelligent, even if you got to watch it in simulation for a million years, then we still definitely all die
I thought those same specific reasons carried over to machine learning more than I currently think they do
where it seemed to me like you were trying to say something like “i agree that random over general programs is bad, but for all i know, random over mind architectures has a high chance of being good” and i’m like “hmm well it sounds like we have a disagreement then”
Specifically for all I know random over parameter space of maybe superintelligent planners conditioned on some straightforward SGD plan is good
I mean, I’m not gonna risk it
But I don’t have mathematical certainty we’re fucked like I do with python programs
I thought those same specific reasons carried over to ML more than I currently think they do
so here’s a thing i believe: p(survival | solar system is sampled randomly from physical configurations) << p(survival | solar system is arranged according to a superintelligent program sampled according to simplicity) << p(survival | solar system is arranged according to a randomly trained mind) <* p(survival | solar system is arranged according to a random evolved alien species)
it sounds like there’s maybe some debate about the strength of the <*
I assume you mean not randomly trained, but just that we keep doing the same thing we’ve been doing
yeah, sorry, “sampled randomly from the space of trained minds”
Yeah cool, so I agree with all of them. To be clear, trained by humans who are trying to take over the world and haven’t thought about this, let’s say
attempting to distinguish two hypotheses you might be arguing for: are you arguing for something more like (a) maybe lots of trained minds happen to be nice (e.g. b/c curiosity always ends up in them in the same way); or (b) maybe a little bit of ‘design’ (in the sense of that-which-humans-do-and-evolution-does-not) goes a long way
The second one
Not the first at all
But I don’t have mathematical certainty we’re fucked like I do with python programs
This is where I’m at. Like I know we’re fucked if you select python programs using behavior only
and is the idea more like
“with a little design effort, getting curiosity in them in the right way is actually easy”
“with a little design effort, maybe we can make limited corrigible things that we can use to do pivotal acts, without needing to load things like curiosity”;
“with a little design effort, maybe we can load all sorts of other things, unlike curiosity, that still add up to something Friendly”?
It’s more like SGD is some sort of magic, that for some reason has some sort of prior that doesn’t kill us. Like for instance, maybe scheming is very penalized because it takes longer and ML penalizes run time
(if that’s actually supposed to carry weight, perhaps we do need to drill down on the ‘scheming’ stuff, previously noted as a place where i suspect we diverge)
It does seem kinda crazy for it to be that big of an advantage
well the rough form of the argument is “goals aren’t selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex”
I want to get into this again. So does it seem right to you to say that the main point of the evolution analogy is that all sorts of random shit will do very well on your loss function at a given capability level? Then if the thing gets more capability, you realize that it did not internalize the loss as it starts getting the random shit?
more-or-less
What’s missing or subtly wrong
a missing piece: the shit isn’t random, it’s a consequence of the mind needing to achieve low-ish loss while dumb
Does that tell us something else risk relevant?
which is part of a more general piece, that the structure of the mind happens for reasons, and the reasons tend to be about “that’s the shortest pathway to lower loss given the environment and the capability level”, and once you see that there are all sorts of path-dependent specific shortcuts it starts to seem like the space of possible mind-architectures is quite wide
then there are deeper pieces, zooming into the case of humans, about how the various patched-on pieces are sometimes in conflict with each other and other hacks are recruited to resolve those conflicts, resulting in this big dynamic system with unknown behavior under reflection
Can you give two examples of path dependent specific shortcuts for the same loss function?
sure
hunger, curiosity
Right
Hmm
Ok i was imagining like, maybe to breed you get really into putting your penis into lips, or maybe you get really into wrapping your penis in warm stuff
hunger, curiosity
So they aren’t mutually exclusive?
These aren’t like training histories
What are the paths that hunger and curiosity are dependent on?
maybe i don’t understand your question, but yeah, the sort of thing i’m talking about is “the easiest way to perturb a mind to be slightly better at achieving a target is rarely for it to desire the target and conceptualize it accurately and pursue it for its own sake”
Ahh nice that’s very helpful
there’s often just shortcuts like “desire food with appropriate taste profiles” or whatever
the specifics of hunger are probably pretty dependent on the specifics of biology and available meals in the environment of evolutionary adaptedness
i wouldn’t be surprised if the specifics of curiosity were dependent on the specifics of the social pressures that shaped us
(though also, more generally, it being curiosity-per-se that got promoted to terminal, as opposed to a different cut of the possible instrumental strategies being promoted, seems like a roll of the dice to me)
The thing I think Quintin successfully criticizes is the analogy as an n = 1 argument for misalignment by default, which to be fair was already a very weak argument
also i suspect that curiosity is slightly more likely to be something that random minds absorb into their terminal goals, depending how those dice come up.
things like “fairness” and “friendship” seem way more dependent on the particulars of the social environment in the environment of evolutionary adaptedness
and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection
and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection
Says the CEV guy
though another big chunk of my eyebrow-raising comes from the implicit hypothesis that “absorb a human-like slice of the instrumental values into terminal values in a human-like way” is a particularly generic way to do things even in wildly different architectures under wildly different training regimes
Says the CEV guy
(We don’t need to open that can or worms now, but I would like to some day)
though another big chunk of my eyebrow-raising comes from the implicit hypothesis that “absorb a human-like slice of the instrumental values into terminal values in a human-like way” is a particularly generic way to do things even in wildly different architectures under wildly different training regimes
Ok yeah, I also think that’s bs
and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection
And agree with this
here i have some sense of, like, “one could argue that all land-based weight-carrying devices must share properties of horses all day long, before their hypothesis space has been suitably widened”
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
(We don’t need to open that can or worms but I would like to some day)
(yeah seems like a tangent here, but i will at least note that “all architectures and training processes lead to the absorption of instrumental values into terminal values in a human-esque way, under a regime of human-esque reflection” and “most humans (with fully-functioning brains) have in some sense absorbed sufficiently similar values and reflective machinery that they converge to roughly the same place” seem pretty orthogonal to me)
I now can’t give you a number with anything like the mathematical precision I used to think I could give
ehh it feels to me like i can get you more than 100:1 against alignment by default in the very strongest sense; i feel like my knowledge of possible mind architectures (and my awareness of stochastic gradient descent-accessible shortcut-hacks) rules out “naive training leads to friendly AIs”
probably more extreme than 2^-100:1, is my guess
it seems to me like all the room for argument is in “maybe with cleverness and oversight we can do much better than naive training, actually”
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
I still think I can do this if we’re searching over python programs
The thing I think Quintin successfully criticizes is the analogy as an n = 1 argument for misalignment by default, which to be fair was already a very weak argument
(yeah i have never really had the sense that the evolutionary arguments he criticizes are the ones i’m trying to make)
I still think I can do this if we’re searching over python programs
yeah sure
and is the idea more like
“with a little design effort, getting curiosity in them in the right way is actually easy”
“with a little design effort, maybe we can make limited corrigible things that we can use to do pivotal acts, without needing to load things like curiosity”;
“with a little design effort, maybe we can load all sorts of other things, unlike curiosity, that still add up to something Friendly”?
2 seems best to me right now
it seems maybe worth saying that my model says: i expect the naive methods to take lots of hacks and shortcuts and etc., such that it’d betray-you-if-scaled in a manner that would be clear if you knew how to look and interpret what you saw, and i mostly expect humanity to die b/c i expect them to screw their eyes shut in the relevant ways
Ok yeah that seems plausible
and in particular, if you could figure out how these minds were working, and see all the shortcuts etc. etc., you could probably figure out how to do the job properly (at least to the bar of a minimal pivotal task)
this is part of what i mean by “i don’t think alignment is all that hard”
my high expectation of doom comes from a sense that there’s lots of hurdles and that humanity will flub at least one (and probably lots)
so insofar as you’re trying to argue something like “masters of mind and cognition would maybe not have a challenge” i’m like “yeah sure”
(though insofar as you’re arguing something like “maybe naive techniques work” i’m like “i think i see enough hacky shortcuts that hill-climbing-like approaches can take, and all the ‘clever’ ideas people propose seem to me to just wade around in the sea of hacky shortcuts, and i don’t personally have hope there”)
i shall now stfu
Summary and discussion of training an agent in a simulation
Ok, so, I want to summarize where we are. Here are some things that seem important to me:
We agree roughly about what happens if you select a python program for superintelligent good behavior: you almost always end up with an unaligned mind that will coup eventually.
I was like well, Quintin convinced me that the prior over models is very different from the prior over python programs.
I put the main argument in terms of prior. You basically were like nah, it’s not just the prior, it also matters a lot that SGD is going to do things incrementally. Most incremental changes you can make to a mind to achieve a certain loss are not going to cause the mind to be into the loss itself.
(i’m not entirely sure what “select for superintelligent good behavior” means; i’d agree “simplicity-sampled python superintelligences kill you (if you have enough compute to run them and keep sampling until you get one that does anything superintelligent at all)” and if you want to say “that remains true if you condition on ones that behave well in a training setup” then i’d need to know what ‘well’ means and what the training setup is. but i expect this not to be a sticking point.)
that sounds roughly right-ish to me,
though i don’t really understand where you draw the distinction between “the prior over models (selected by SGD)” and “arguments about how incremental changes are likely to affect minds”
(i’m not entirely sure what “select for superintelligent good behavior” means; i’d agree “simplicity-sampled python superintelligences kill you” and if you want to say “that remains true if you condition on ones that behave well in a training setup” then i’d need to know what ‘well’ means and what the training setup is. but i expect this not to be a sticking point.)
I mean you get to watch them in simulation but you do not get to read/understand the code
like, from my perspective, the arguments about incrementality are sorta arguments about what the prior over (SGD-trained) models looks like
but also i don’t care / i’m not bidding we go into it, i’m just noting where things seem not-quite-how-i-would-dice-them-up
Interesting, I’m thinking of them as like, being part of P(model | data) rather than P(model)
i’d also additionally note that the point i’m trying to drive at here is a little less like “incremental changes don’t make the mind care about loss” and a little more like “the prior is still really wide, so wide that a counting argument still more-or-less works”
I mean you get to watch them in simulation but you do not get to read/understand the code
(sure, but like, how ironclad is the simulation and what are you watching them do?)
i’d also additionally note that the point i’m trying to drive at here is a little less like “incremental changes don’t make the mind care about loss” and a little more like “the prior is still really wide, so wide that a counting argument still more-or-less works”
I mean, this seems intuitively plausible to me, but I wouldn’t be able to convince a reasonable person to who it was not intuitively plausible
the place where the argument is supposed to have force is related to “you can argue all you want that any flying device will have to flap its wings, and that won’t constrain airplane designs”
i’m not sure whether you’re saying something like “i don’t believe that that’s the actual part of the argument that has force, and hereby query you to articulate more of the forceful parts of the argument”, vs whether you’re saying things like “that argument is not in a format accepted by the Unconvinceable Hordes, despite being valid and forceful to me”, or...?
(but for the record, insofar as you’re like “do you expect that to convince the Unconvinceable Hordes?” i’m mostly like “no, mr bond, i expect us to die”)
No no, I’m imagining convincing Ronnys with only slightly different intuitions or histories. Such people are much more convinceable
More like the first, and it’s more like I don’t quite understand where the force comes from
do you have much experience with programming?
Only python and only like pretty junior level
do you have familiarity with, like, the sense that a task seems straightforward, only to find an explosion of options in the implementation?
I don’t know, I implemented gpt 2 with tutors once
do you have familiarity with, like, the sense that a task seems straightforward, only to find an explosion of options in the implementation?
yeah
maybe not enough
For sure with airtables and zapier automations actually
cool
and relatedly, consider… arguments that airplanes must flap their wings. or arguments that computers shouldn’t be able to run all that much faster than brains. or arguments that robots must run on something kinda like a metabolic system.
where the point in those examples is not just “artificial things work with different underlying mechanics than biological things” but that there are lots of ways to do things, including ones that start way outside your hypothesis space
in fact artificial things do not work with different underlying mechanics, there’s just lots of mechanics and it rarely turns out that we do it the same way
right
and when you don’t understand in detail how two (or possibly even three) different things work then you’re likely to dramatically underestimate the width of the space
perhaps i am moving too fast here. it sounded to me like you were like “the prior over models is different than the prior over programs” and i was like “yep” and then you were like “so there’s an appreciable chance i’ll win the lottery” and i was like ”?? no” and you were like “wait why not?” and i was like “because the space is still real wide”
Yeah, I think definitely any space of models large enough to contain superintelligent aligned things also contains lots and lots of superintelligent non-aligned things
Alignment problem probably fixable, but likely won’t be fixed
“SGD makes incremental changes, and the minds have to work while dumb, and there are lots of ways for SGD to make a mind work better while dumb that don’t do the thing you want” is an argument that’s (a) correct in its own right, but also (b) sheds light on how many ways to do the job wrong there are
from which, i claim, it’s proper to generalize that the space is wide; to see that arguments of the form “maybe i win the lottery” are basically analogous to arguments of the form “maybe human minds are near the limit of physical constraints on cognition”
Of the form?
Not just of the quality?
i’m not sure what distinction you’re trying to draw there
Like they’re analogous somehow
the arguments seem to me similarly valid (and in particular, invalid)
like, the “SGD makes incremental changes” is one plausible-feeling example of how if you really understood what was going on inside that mind, you’d cry out in terror
from which we generalize not that you’ll see exactly that thing, but that you will in fact cry out in terror
when there’s a plausible way for code to have the cry-out-in-terror property, it very likely will unless counter-optimization was somehow applied
Is the main problem here like that you end up with something that will coup you later or something that will build things that will kill you later/get smarter and then start wearing condoms
so my argument is not “and this survives all counter-optimization”
my reason for expecting doom is not that i think this problem is unfixable, it’s that i think it won’t be addressed
that said, my guess is that it will take something much more like “understand the mind” than “provide better training”
but, like, the argument against “just put a bit of thought into the training” working has a bunch less force than the argument against “just train it to be good” working
(still, i think, considerable force, but)
but does survive all counter optimization selecting only on behavior, or no?
Is the main problem here like that you end up with something that will coup you later or something that will build things that will kill you later/get smarter and then start wearing condoms
not sure i see the distinction
Like, humans weren’t waiting around for a certain number to be factored until they couped evolution, they’re more like the second thing
that seems to me that it’s more of a fact about evolution not watching them and slapping down visible defiance, than a fact about human psychology?
(or, well, a fact about “evolution not slapping down visible defiance” plus a fact about “humans not yet being smart enough to coordinate to overcome that”, but)
Yeah, like if evolution were very shortsighted, I think it should be happy with early us
I think similarly, if we are very shortsighted, we might be happy with early models before they’re capable enough that the divergence between what we wanted and they want is apparent
sure
...insofar as there’s a live question, i still don’t understand it
Well this is different from: you get a superintelligence, and it’s like “hmm, I’m not sure if I’m in training or not, let me follow a strategy that maximizes my chance of couping when not in training”
if you’re like “what goes wrong if you breed chimps to be better at inclusive genetic fitness and also smarter” then i’m mostly like “a chimp needs to eat long before it can conceptualize calories; the hunger thing is going to be really deep in there” (or, more generally, you’d get some mental architecture that solves your training problem while being unlike how you wanted, but I’ll use that example for now).
could you in principle breed them to the point that they stop having a hunger drive and start hooking in their caloric-intake to their explicit-model of IGF? probably, but it’d probably take (a) quite a lot of training, and (b) a bunch of visibility into the mind to see what’s working and what’s not.
mostly if you try that you die to earlier generations that rise up against you; if not then you die to the fact that you were probably measuring progress wrong (and getting things that still deeply enjoy eating nice meals but pretend they don’t b/c that’s what it turns out you were training for)
i doubt that the rising-up ever needs to depend on factoring a large number; that only happens if the monkeys think you’re extremely good at spoofing their internal states, and you aren’t (in this hypothetical where you don’t actually understand much of what’s going on in their minds)
but whether it happens right out in the open (because you, arguendo, don’t understand their minds well enough to read those thoughts) or whether it feels like a great betrayal (e.g. because they were half-convinced that they were your friends, and only started piecing things together once they got smarter) feels like… i dunno, could go either way
(cf planecrash, i think that big parts of planecrash were more-or-less about this point)
Well this is different from: you get a superintelligence, and it’s like “hmm, I’m not sure if I’m in training or not, let me follow a strategy that maximizes my chance of couping when not in training”
yeah this ~never happens, especially if you haven’t attained mastery of their mind
it’s maybe possible if you take the “master mind” route, though i really would not recommend it; if you have that kind of mastery you should have better options available
Discussing whether this argument about training can be formalized
shall we make this into an LW dialog of some sort? push for more formality?
Yeah, I’m down. Let’s do both.
seems kinda hard to make something formal to me because the basic argument is, i think, “there’s really a lot of ways for a model to do well in training”, but i don’t know how one is supposed to formalize that. i guess i’m curious where you think the force of formality comes in for the analogous argument when it comes to python programs
I think it’s like if someone asks me why do you think the probability is so low? I can explain that I have an argument that it’s ridiculously low that I have a fair amount of weight on.
sure, but what is the formal argument (for python programs) that it’s ridiculously low?
like, suppose someone says “you don’t know what python programs conditioned on apparent-superintelligence and apparent-goodness do, why should i believe you?”
Ahh great, I did write a post about this but:
Uhh suppose that you have arbitrary finite compute, but no mind access. You get to select programs by running them in an extremely high fidelity environment that even a superintelligence can’t distinguish from reality. You run them from some lab and then watch for five years. Almost always nothing happens, so you go to the next program. Interestingly, you do eventually find superintelligent models but they almost always act very aligned.
Why is this? Because P(acting aligned for 5 years | simulation and not aligned) ~ P(acting aligned for 5 years | simulation and aligned). So, we are stuck figuring out P(aligned).
Well, human values are complicated. It takes lots of bits to specify them. So it’s unlikely that you end up with a python program that has that as goals, since literally any mind with any other goal would also act that way in the simulation.
So say it take 1000 bits to specify human values, then you know you have roughly 2^-1000. Because your sim only gives you evidence of superintelligence, not of sharing the particular goal.
(The argument is similar for P(corrigible).)
i’m slightly hung up here on how it doesn’t just hack its way through the outer observers but
Yeah I don’t know, magic
and what would you say to the objection that the observers can simply wait and watch for something that professes human values from the get-go, rather than waiting to profess its values until it looks around and figures out what the observers want?
Well back when I used to think this was analogous-ish to machine learning, I would have said that won’t happen
It will already have info about what you want
It might for instance have a record of what programs you vetoed before
is the claim “misaligned AIs will figure out what you want so quickly that you won’t be able to notice their hesitation”?
No no, it’s that if you were training the boy, he would know. You’re not actually, but it’s supposed to be an analogy, so you give him the power
wait, so you’re like “we’re going to take arbitrary python programs, credibly show them they’re under observation for 5y, ask them to do nice things for us, and then set them loose if they did in fact do the nice things”?
Well, they figure out they’re in the simulation. They do have to figure that bit out. But they don’t have to figure out that they’re being watched by humans
And I’m like you can do whatever, but you don’t have mind access, and you have arbitrary finite compute. You have to select a superintelligent program using behavior. You’re fucked no matter what you do
but the force of the argument here is supposed to come from “the observation that it does nice things while under our thumb is not much evidence about whether it’ll do nice things when unleashed”?
but the force of the argument here is supposed to come from “the observation that it does nice things while under our thumb is not much evidence about whether it’ll do nice things when unleashed”
That and that P(aligned or corrigible) is tiny
Also we don’t have to tell it it’s under our thumb, we just have to tell it we’re humans
but suppose someone says: aha, but we are training the boy, and so this argument doesn’t have nearly the force of 2^-1000, because there exist python programs that, in less than 1000 bits, say “optimize whatever concept you’re being trained towards”
Yeah or similar. I think it’s much less analogous than I used to think it was, but the broad structure I think is in some ways similar to the broad structure of the argument you gave
At the level of like your argument is still, your data isn’t much evidence, and the prior of your favored outcome is tiny
(yeah i think my state is something like “old argument was strong but not that strong; new argument is strong but not that strong” and i can’t tell whether you’re like “i now agree (but used to not)” or whether you’re like “it still looks to me like old argument was super strong, new argument is comparably weak”)
Old argument is still strong for python programs, is weak as analogy for machine learning. I want comparably strong argument for ML
Or like, I want to dig in on why the evidence is weak, and why the prior is small in ML. No analogy
in the sampled-python-program case, it does seem to me like the number of bits in the exponent is bounded by min(|your-values|, |do-what-they-mean|). where my guess is that |do-what-they-mean| is shorter than |human-values|, which weakens the argument somewhat
(albeit not as much shorter as one might hope; it probably takes a lot of humane values to figure out “what we mean” in a humane way)
(this is essentially the observation that indirect normativity is probably significantly easier that fully encoding our values, albeit still not easy)
(perhaps you’re like “eh, it still seems it should take hundreds if not thousands of bits to code for indirect normativity”?, to which i’d be like “sure maybe”, as per the first parenthetical caveat)
same point from a different angle: the strength of the argument is not based on the K-complexity of our values, it’s based on the cross-entropy between our values and the training distribution
same point from a different angle: the strength of the argument is not based on the K-complexity of our values, it’s based on the cross-entropy between our values and the training distribution
Interesting
I mean even if it was 3 bits, 7⁄8 times we’re fucked
yeah totally (i was thinking of saying that myself)
Good enough for me
so part of why i’m drilling on this here is something like “i suspect that model-space and program-space are actually pretty similar analogy-wise, and that the thing where your intuitions treat them very differently is that when you think of training models, for some reason that makes the difference between K-complexity(values) and cross-entropy(training-data, values) more salient
though i guess you might be like “SGD is a really different prior from program length”
(another way of saying the observation is that the strength of the argument is not based on the K-complexity of our values, it’s based on the cross-entropy between our values and the training distribution)
I wanna understand this point more then. That’s interesting
tho i guess you might be like “SGD is a really different prior from program length”
Yeah that’s right
Or like model space is
Maybe also SGD is unlike Bayesian updating
yeah i don’t understand the “model space is different” thing, like, models are essentially just giant differentiable computation graphs (and they don’t have to use all that compute); i don’t see what’s so different between them and python programs
(it sounds almost like someone saying “ok i see how this argument works for python, but i don’t understand how it’s supposed to work for C” or something)
though “well we search the space very differently” makes sense to me
Well for one, they’re of finite run time
That seems pretty importantly different
so does your whole sense of difference go out the window if we do something autogpt-ish?
Let me think
It’s still weird in that you’re selecting a finite run time thing and then iterating that exact thing
sure
does the difference go out the window once people are optimizing in part according to the auto-ized version’s performance?
Yeah it sure starts to for me? I feel like I’ll talk to Quintin at some point and then he’ll make me not feel that way, though
and: how about “runs for long enough, e.g. by doing a finite-but-large number of loops though a fixed architecture”?
and: how about “runs for long enough, e.g. by doing a finite-but-large number of loops though a fixed architecture”?
How’s this different from the last one?
/shrug, it’s not supposed to be a super sharp line, but on one end of the spectrum you could imagine lower-level loops/recurrence in training (after some architecture tweaks), and on the other end of the spectrum you could imagine language models playing a part in larger programs a la auto-GPT
also, if runtime got long enough, would it stop mattering?
I mean definitely if it got long enough
Enough might be really big
There’s programming languages that compile into transformers. I wonder what they’re like
cool. so if we’re like “well, SGD may find different programs, and also we’re currently selecting over programs for their ability to perform a single pretty-short pass well”, then i’m like: yep those seem like real differences
i agree that if #2 holds up then that could shake things up a fair bit.
but insofar as your point is supposed to hold even if #2 falls, it seems to me that you’re basically saying that the cross-entropy between the training distribution and human values might be way smaller when we sample according to SGD rather than when we sample according to program length
i suspect that’s false, personally
though also i guess i’ll pause and give you an opportunity to object to this whole frame
I’m a bit worried about Quintin feeling misrepresented by me so I guess I should say that I am emphatically not representing Quintin here. I def want to say something like I’m sure Quintin would be much more persuasive to me than I was to myself, and that if Quintin were sitting next to me coaching me, I would’ve been much more convincing to everyone overall. I’m pretty confident of that.
I think the best thing for me to do here is to go off and read some more things that are optimistic about good results from scaling ML to superintelligence, and then come back and have another conversation with you.
Nate, please correct me if I’m wrong, but it looks like you:
Skimmed, but did not read, a 3,000-word essay
Posted a 1,200-word response that clearly stated that you hadn’t read it properly
Ignored a comment by one of the post’s authors saying you thoroughly misunderstood their post and a comment by the other author offering to have a conversation with you about it
Found a different person to talk to about their views (Ronny), who also had not read their post
Participated in a 7,500-word dialogue with Ronny in which you speculated about what the core arguments of the original post might be and your disagreements
You’ve clearly put a lot of time into this. If you want to understand the argument, why not just read the original post and talk to the authors directly? It’s very well-written.
I don’t want to speak for Nate, and I also don’t want to particularly defend my own behavior here, but I have kind of done similar things around trying to engage with the “AI is easy to control” stuff.
I found it quite hard to engage with directly. I have read the post, but I would not claim to be able to be close to passing an ITT of its authors and bounced off a few times, and I don’t currently expect direct conversation with Quintin or Nora to be that productive (though I would still be up for it and would give it a try).
So I have been talking to friends and other people in my social circle who I have a history of communicating well with about the stuff, and I think that’s been valuable to me. Many of them had similar experiences, so in some sense it did feel like a group of blind men groping around an elephant, but I don’t have a much better alternative. I did not find the original post easy to understand, or the kind of thing I felt capable of responding to.
I would kind of appreciate better suggestions. I have not found just forcing myself to engage more with the original post to help me much. Dialogues like this do actually seem helpful to me (and I found reading this valuable).
How much have you read about deep learning from “normal” (non-xrisk-aware) AI academics? Belrose’s Tweet-length argument against deceptive alignment sounds really compelling to the sort of person who’s read (e.g.) Simon Prince’s textbook but not this website. (This is a claim about what sounds compelling to which readers rather than about the reality of alignment, but if xrisk-reducers don’t understand why an argument would sound compelling to normal AI practitioners in the current paradigm, that’s less dignified than understanding it well enough to confirm or refute it.)
I think I could pass the ITTs of Quintin/Nora sufficiently to have a productive conversation while also having interesting points of disagreement. If that’s the bottleneck, I’d be interested in participating in some dialogues, if it’s a “people genuinely trying to understand each other’s views” vibe rather than a “tribalistically duking it out for the One True Belief” vibe.
This is really interesting, because I find Quintin and Nora’s content incredibly clear and easy to understand.
As one hypothesis (which I’m not claiming is true for you, just a thing to consider)—When someone is pointing out a valid flaw in my views or claims, I personally find the critique harder to “understand” at first. (I know this because I’m considering the times where I later agreed the critique was valid, even though it was “hard to understand” at the time.) I think this “difficulty” is basically motivated cognition.
I am a bit stressed right now, and so maybe am reading your comment too much as a “gotcha”, but on the margin I would like to avoid psychologizing of me here (I think it’s sometimes fine, but the above already felt a bit vulnerable and this direction feels like it disincentivizes that). I generally like sharing the intricacies and details of my motivations and cognition, but this is much harder if this immediately causes people to show up to dissect my motivations to prove their point.
More on the object-level, I don’t think this is the result of motivated cognition, though it’s of course hard to rule out. I would prefer if this kind of thing doesn’t become a liability to say out loud in contexts like this. I expect it will make conversations where people try to understand where other people are coming from go better.
Sorry if I overreacted in this comment. I do think in a different context, on maybe a different day I would be up for poking at my motivations and cognition and see whether indeed they are flawed in this way (which they very well might be), but I don’t currently feel like it’s the right move in this context.
FWIW, I think your original comment was good and I’m glad you made it, and want to give you some points for it. (I guess that’s what the upvote buttons are for!)
Fwiw, I generally find Quintin’s writing unclear and difficult to read (I bounce a lot) and Nora’s clear and easy, even though I agree with Quintin slightly more (although I disagree with both of them substantially).
I do think there is something to “views that are very different from ones own” being difficult to understand, sometimes, although I think this can be for a number of reasons. Like, for me at least, understanding someone with very different beliefs can be both time intensive and cognitively demanding—I usually have to sit down and iterate on “make up a hypothesis of what I think they’re saying, then go back and check if that’s right, update hypothesis, etc.” This process can take hours or days, as the cruxes tend to be deep and not immediately obvious.
Usually before I’ve spent significant time on understanding writing in this way, e.g. during the first few reads, I feel like I’m bouncing, or otherwise find myself wanting to leave. But I think the bouncing feeling is (in part) tracking that the disagreement is really pervasive and that I’m going to have to put in a bunch of effort if I actually want to understand it, rather than that I just don’t like that they disagree with me.
Because of this, I personally get a lot of value out of interacting with friends who have done the “translating it closer to my ontology” step—it reduces the understanding cost a lot for me, which tends to be higher the further from my worldview the writing is.
Yeah, for me the early development of shard theory work was confusing for similar reasons. Quintin framed values as contextual decision influences and thought these were fundamental, while I’d absorbed from Eliezer that values were like a utility function. They just think in very different frames. This is why science is so confusing until one frame proves useful and is established as a Kuhnian paradigm.
What is the 2^-100:1 part intended to mean? Was it a correction to the 100:1 part or a different claim? Seems like an incredibly low probability.
Separately:
This seems straightforwardly insane to me, in a way that is maybe instructive. Ronnie has updated from an odds ratio of 2^-10000:1 to one that is (implicitly) thousands of orders of magnitude different, which should essentially never happen. Ronnie has just admitted to being more wrong than practically anyone who has ever tried to give a credence. And then, rather than being like “something about the process by which I generate 2^-10000:1 chances is utterly broken”, he just.… made another claim of the same form?
I don’t think there’s anything remotely resembling probabilistic reasoning going on here. I don’t know what it is, but I do want to point at it and be like “that! that reasoning is totally broken!” And it seems to me that people who assign P(doom) > 90% are displaying a related (but far less extreme) phenomenon. (My posts about meta-rationality are probably my best attempt at actually pinning this phenomenon down, but I don’t think I’ve done a great job so far.)
my original 100:1 was a typo, where i meant 2^-100:1.
this number was in reference to ronny’s 2^-10000:1.
when ronny said:
i interpreted him to mean “i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment”.
i personally think this is wrong, for reasons brought up later in the convo—namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.
but this was before i raised that objection, and my understanding of ronny’s position was something like “specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models”. to which i was attempting to reply “man, i can see enough ways that ML models could turn out that i’m pretty sure it’d still take at least 100 bits”.
i inserted the hedge “in the very strongest sense” to stave off exactly your sort of objection; the very strongest sense of “alignment-by-default” is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it’s aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like “i think that i can see enough other ways to perform well on tasks that there’s e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars”.
this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there’s more than a 2^-100 chance that there’s some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).
my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny’s would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was “there’s a naive-but-relevant model that say’s we’re super-duper fucked; the details of it causes me to think that we’re not in particulary good shape (though obviously not to that same level of credence)”.
but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).
(yeah, my guess is that you’re suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)
(I obviously don’t speak for Ronny) I’d guess this is kinda the within-model uncertainty, he had a model of “alignment” that said you needed to specify all 10,000 bits of human values. And so the odds of doing this by default/at random was 2^-10000:1. But this doesn’t contain the uncertainty that this model is wrong, which would make the within-model uncertainty a rounding error.
According to this model there is effectively no chance of alignment by default, but this model could be wrong.
If Ronnie had said “there is one half-baked heuristic that claims that the probability is 2^-10000” then I would be sympathetic. That seems very different to what he said, though. In some sense my objection is precisely to people giving half-baked heuristics and intuitions an amount of decision-making weight far disproportionate to their quality, by calling them “models” and claiming that the resulting credences should be taken seriously.
I think that would be a more on-point objection to make on a single-author post, but this is a chat log between two people, optimized to communicate to each other, and as such generally comes with fewer caveats and taking a bunch of implicit things for granted (this makes it well-suited for some kind of communication, and not others). I like it in that it helps me get a much better sense of a bunch of underlying intuitions and hunches that are often hard to formalize and so rarely make it into posts, but I also think it is sometimes frustrating because it’s not optimized to be responded to.
I would take bets that Ronny’s position was always something closer to “I had this robust-seeming inside-view argument that claimed the probability was extremely low, though of course my outside view and different levels of uncertainty caused my betting odds to be quite different”.
I meant the reasonable thing other people knew I meant and not the deranged thing you thought I might’ve meant.
In retrospect I think the above was insufficiently cooperative. Sorry,
I don’t see why it should rule out high probability of doom for some folks who present themselves as having good epistemics to actually be quite bad at picking up new models and stuck in an old, limiting paradigm, refusing to investigate new things properly because of believing themselves to already know. It certainly does weaken appeals to their authority, but the reasoning stands on its own, to the degree it’s actually specified using valid and relevant formal claims.
To be clear, I did not think we were discussing the AI optimist post. I don’t think Nate thought that. I thought we were discussing reasons I changed my mind a fair bit after talking to Quintin.
This may not be easily formalizable, but this does seem easily testable? Like, whats wrong with just training a bunch of different models, and seeing if they have similar generalization properties? If they’re radically different, then there’s many ways of doing well in training. If they’re pretty similar, then there’s very few ways of doing well in training.
I think evolution-of-humans is kinda like taking a model-based RL algorithm (for within-lifetime learning), and doing a massive outer-loop search over neural architectures, hyperparameters, and also reward functions. In principle (though IMO almost definitely not in practice), humans could likewise do that kind of grand outer-loop search over RL algorithms, and get AGI that way. And if they did, I strongly expect that the resulting AGI would have a “curiosity” term in its reward function, as I think humans do. After all, a curiosity reward-function term is already sometimes used in today’s RL, e.g. the Montezuma’s Revenge literature, and it’s not terribly complicated, and it’s useful, and I think innate-curiosity-drive exists not only in humans but also in much much simpler animals. Maybe there’s more than one way to implement curiosity-drive in detail, but something in that category seems pretty essential for an RL algorithm to train successfully in a complex environment, and I don’t think I’m just over-indexing on what’s familiar.
Again, this is all pretty irrelevant on my models because I don’t expect that people will program AGI by doing a blind outer-loop search over RL reward functions. Rather, I expect that people will write down the RL reward function for AGI in the form of handwritten source code, and that they will put curiosity-drive into that reward function source code (as they already sometimes do), because they will find that it’s essential for capabilities.
Separately, insofar as curiosity-drive is essential for capabilities (as I believe), it doesn’t help alignment, but rather hurts alignment, because it’s bad if an AI wants to satisfy its own curiosity at the expense of things that humans directly value. Hopefully that’s obvious to everyone here, right? Parts of the discussion seemed to be portraying AIs-that-are-curious as a good thing rather than a bad thing, which was confusing to me. I assume I was just failing to follow all the unspoken context?
maintaining uncertainty about the true meaning of an objective is important, but there’s a difference between curiosity about the true values one holds, intrinsic curiosity as a component of a value system, and instrumental curiosity as a consequence of an uncertain planning system. I’m surprised to see disagree from MiguelDev and Noosphere, could either of you expand on what you disagree with?
@the gears to ascension Hello! I just think curiosity is a low level attribute that allows a reaction and it maybe good or bad all things considered, with this regard curiosity (or studying curiosity) may help in alignment as well.
For example, an AI is in a situation that it needs to save someone from a burning house, it should be curious enough to consider all possible options available and eventually if it is aligned—it will choose the actions that will result to good outcomes (after also studying all the bad options). That is why I don’t agree with the idea that it purely hurts alignment as described in the comment.
(I think Nate and Ronny shares important knowledge in this dialogue—low level forces (birthed by evolution) that I think is misunderstood by many.)
Your example is about capabilities (assuming the AI is trying to save me from the fire, will it succeed?) but I was talking about alignment (is the AI trying to save me from the fire in the first place?)
I don’t want the AI to say “On the one hand, I care about Steve’s welfare. On the other hand, man I’m just really curious how people behave when they’re on fire. Like, what do they say? What do they do? So, I feel torn—should I save Steve from the fire or not? Hmm…”
(I agree that, if an AI is aligned, and if it is trying to save me from a burning house, then I would rather that the AI be more capable rather than less capable—i.e., I want the AI to come up with and execute very very good plans.)
See also colorful examples in Scott Alexander’s post such as:
As for capabilities, I think curiosity drive is probably essential during early RL training. Once the AI is sufficiently intelligent (including in metacognitive / self-reflective ways), it’s plausible that we could turn curiosity drive off without harming capabilities. After all, it’s possible for an AI to “consider all possible options” not because it’s curious, but rather because it wants me to not die in the fire, and it’s smart enough to know that “considering all possible options” is a very effective means-to-an-end for preventing me from dying in the fire.
Humans can do that too. We don’t only seek information because we’re curious; we can also do it as a means to an end. For example, sometimes I have really wanted to do something, and so then I read an mind-numbingly-boring book that I expect might help me do that thing. Curiosity is not driving me to read the book; on the contrary, curiosity is pushing me away from the book with all its might, because anything else on earth would be more inherently interesting than this boring book. But I read the book anyway, because I really want to do the thing, and I know that reading the book will help. I think an AI which is maximally beneficial to humans would have a similar kind of motivation. Yes it would often brainstorm, and ponder, and explore, and seek information, etc., but it would do all those things not because they are inherently rewarding, but rather because it knows that doing those things is probably useful for what it really wants at the end of the day, which is to benefit humans.
Interesting view but I have to point out that situations change and there will be many tiny details that will become like a back and forth discussion inside the AI’s network as it performs its tasks and turning off curiosity will most likely end up in the worst outcomes as it my not be able to update its decisions (eg. oops didn’t saw there was a fire hose available or oops I didn’t felt the heat of the floor earlier).
Person A: If you’re going to program a chess-playing agent, it needs a direct innate intrinsic drive to not lose its queen.
Person B: Nah, if losing one’s queen is generally bad, it can learn that fact from experience, or from thinking through the likely consequences in any particular case.
Person A: No, that’s not good enough. Protecting the queen is really important. Maybe the AI will learn from experience to not lose its queen in some situations, but situations change and then it will not be motivated to protect its queen sufficiently.
Obviously, Person B is correct here, because AlphaZero-chess works well.
To my ears, your claim (that an AI without intrinsic drive to satisfy curiosity cannot learn to update its decisions) is analogous to Person A’s claim (that an AI without intrinsic drive to protect its queen cannot learn to do so).
In other words, if it’s obvious to you that the AI is insufficiently updating its decisions, it would be obvious to the AI as well (once the AI is sufficiently smart and self-aware). And then the AI can correct for that.
Thanks for explaining your views and this had helped me deconfuse myself, when I was replying and thinking: I am now drawing lines wherein curiosity and self-awareness overlaps also making me feel the expansive nature of studying the theoretical alignment, it’s very dense and it’s so easy to drown in information—this discussion made me feel a whack of a baseball bat then survived to write this comment. Moreover, how to get to Person B still requires knowledge of curiosity and its mechanisms, I still err on the side of finding out how it works[1] or gets imbued to intelligent systems (us and AI) - for me this is very relevant to alignment work.
I’m speculating a simplified evolutionary cognitive chain in humans: curiosity + survival instincts (including hunger) → intelligence → self-awareness → rationality.
You can argue all you want that any thinking device will have to reflect on its thoughts, and that won’t constrain mind designs.
And it also works for arguing that GPT3 won’t happen—there are more hacks that give you low loss than there useful to humans hacks that give you low loss.
I think it should be analyzed separately, but intuitively if your gpt never thinks of killing humans, it should be less likely that the plans with these thoughts would result in killing humans.
In the spirit of pointing out subtle things that seem wrong: My understanding of the ST position is that shards are values. There’s no “building values around” shards; the idea is that shards are what implements values and values are implemented as shards.
At least, I’m pretty sure that’s what the position was a ~year ago, and I’ve seen no indications the ST folk moved from that view.
The way I would put it is “it’s plausible that there is an utility function such that the world-state maximizing it is ranked as very high by the standards of most humans’ preferences, and we could get that utility function by agglomerating and abstracting over individual humans’ values”.
Like, if Person A loves seafood and hates pizza, and Person B loves pizza and hates seafood, then no, agglomerating these individual people’s preferences into Utility Function A and Utility Function B won’t result in the same utility function (and more so for more important political/philosophical stuff). But if we abstract up from there, we get “people like to eat tasty-according-to-them food”, and then a world in which both A and B are allowed to do that would rank high by the preferences of both of them.
Similarly, it seems plausible that somewhere up there at the highest abstraction levels, most humans’ preferences (stripped of individual nuance on their way up) converge towards the same “maximize eudaimonia” utility function, whose satisfaction would make ~all of us happy. (And since it’s highly abstract, its maximal state would be defined over an enormous equivalence class of world-states. So it won’t be a universe frozen in a single moment of time, or tiled with people with specific preferences, or anything like that.)
I was excited to read this, because Nate is a clear writer and a clear thinker, who has a high p(doom) for reasons I don’t entirely understand. This did pay off for me in a brief statement that clarified some of his reasons I hadn’t understood:
Nate said
I find this disturbingly compelling. I hadn’t known Nate thought alignment might be fairly easy. Given that, his pessimism is more relevant to me, since I’m pretty sure alignment is do-able even in the near future.
I’m afraid I found the rest of this convoluted and to make little progress on a contentful discussion.
Let me try to summarize the post in case it’s helpful. None of these are direct quotes
Nate: I think alignment by default is highly unlikely Ronny: I think alignment by default is highly unlikely (this somehow took most of the conversation) Ronny: But we won’t do alignment by default. We’ll do it with RL. Sometimes, when I talk to Quintin, I think we might get working alignment by doing RL and pointing the system at lots of stuff we want it to do. It might reproduce human values accurately enough to do that. Nate: There are a lot of ways to get anything done. So telling it what you want it to do is probably not going to make it generalize well or actually value the things you value. Ronny: I agree, but I don’t have a strong argument for it. …
So in sum I didn’t see any strong argument for it beyond the “lots of ways to get things done, so a value match is unlikely”.
Like Rob and Nate, my intuition says that’s unlikely to work.
The number of ways to get things done is substantially constrained if the system is somehow trained to use human concepts and thinking patterns. So maybe that’s the source of optimism for Quintin and the Shard Theorists? Training on language does seem to substantially constrain a model to use human-like concepts.
I think the bulk of the disagreement is deeper and vaguer. One point of vague disagreement seems to be something like: Theory suggests that alignment is hard. Empirical data (mostly from LLMs) suggests that it’s easy to make AI do what you want. Which do you believe?
Fortunately, I don’t think RL alignment is our only or best option, so I’m not hugely invested in the disagreement as it stands, because both perspectives are primarily thinking about RL alignment. I think We have promising alignment plans with low taxes
I think they’re promising because they’re completely different than RL approaches. More on that in an upcoming post.