Some object-level stuff first: I think my main disagreement comes down to:
Being [well-behaved as far as we can tell] in training is always very weak evidence that behaviour will generalize as we’d wish it to.
I don’t say “aligned in the training data”, since alignment is about robust generalization of good behaviour. Evidence of alignment is evidence of desirable generalization. Eliezer isn’t claiming we won’t get approximately perfect behaviour (as far as we can tell) on the training data; he’s claiming that this gets us almost nowhere in terms of alignment.
Caveat—this is contingent on what counts as ‘behaviour’ and on our tools; if behaviour includes activations, and our tools have hugely improved, this may be progress.
Arguments against particular failure modes often come down to [from what we can tell, inductive bias will tend to push against this particular type of failure].
Of course here I’d point at “from what we can tell” and “tend to”.
However, the more fundamental point is that we have no reason to think that inductive bias pushes towards success either. Does the simplest solution compatible with [good behaviour as far as we can tell] on the training data generalize exactly as we’d wish it to? Why would this be the case? Does the fastest? Again, why would we expect this? Does the [insert our chosen metric]est? Why?
I do expect that there exists some metric with a rich set of inputs (including weights, activations etc.) that would give robustly desirable generalization.
I expect that finding such a metric will require deep understanding.
Expecting a simple metric found based on little understanding to be sufficient is equivalent to assuming that there’s something special about the kind of generalization we would like (other than that we like it).
This is baseless—it’s why I don’t like the term “misgeneralization”, since it can suggest that there’s some natural ‘correct’ generalization, which would be the default outcome if nothing goes wrong. There is no such natural correct generalization (or at least, I’ve seen/made no argument for one—I think natural abstractions may get you [AI will understand our ontology/values], but not that it’s likely to generalize according to those values).
One reply to this is that we don’t have to be that precise—just look at humans. However, humans aren’t an example of successful alignment. (see above—and below)
“OTOH, if I put doom in the reference class of “things I used to believe, kinda” then perhaps I should feel comfortable putting e.g. 10^-5 credence in doom, since I put << 10^-5 credence in Christianity being true, and < 10^-5 credence in Marxism (although the truth conditions for Marxism are murkier.”
A few points here:
Given some claim x you can always find some category it belongs to that contains either [things much more likely to be true than x] or [things much less likely to be true than x] - particularly if you cherry-pick even within that category.
A general principle is that you need to use all your bearing-on-x evidence if you want to form an accurate estimate for x (and since you won’t have time, you want some unbiased approximation). If you pick a small subset of available evidence without care to avoid bias, then your estimate will tend to be badly wrong.
If the only evidence you had were [I had an argument for a very weird conclusion that I now realize is invalid], you’d be reasonable in thinking the conclusion were highly unlikely—but this is not your only evidence.
It’s a pretty standard mistake to overcompensate when moving from [I believe [thing with significant influence on how I live my life]] to [I don’t believe [thing with significant influence on how I live my life]]. It’s hard to break away from a strongly held, motivating belief, but it’s even harder to do so without overcorrecting. In fact, I’d guess that initial overcorrection is often the rational thing to do if we’re aiming at having an accurate assessment later.
It might be bad form to focus on psychology in debates, and I’d like to be clear that my claim is not [Nora/you are clearly making such errors]. The claim I will make is that reflecting on our own psychological reasons to want to believe x should be a standard tool. Ideally we’d do it all the time, but it’s most important when some aspect of your model/argument/belief-state is just as you’d wish it to be—that’s a red flag. A complex, important-to-you thing being almost exactly as you’d wish it to be should be highly surprising, and therefore somewhat suspicious.
For example, I might:
Want to be certain about x.
Want x to be true.
Want my conclusion to appear measured/reasonable/balanced. (I’m so wise with my integration of twelve different perspectives and nuanced 60% credence in x!)
Only you have much hope to get at what’s going on in your head—but it’s important to look (and to be highly suspicious of reflex justifications that just happen to point at exactly the conclusions you’d wish them to).
Obviously I also need to do this, I also frequently fail etc. (many failures being of the form [not even noticing a question])
Going from [believe x] to [disbelieve x] tends to happen when I falsify my arguments for x. However, this shouldn’t take me to [disbelieve x], but initially only to [I believed x for invalid reasons]. Once I make the update to [my reasons were invalid] it’s important for me to reassess my takes on e.g. [the best-informed people believe x for reasons like mine] or [the reasons I believe(d) x are among the strongest arguments for x].
Psychological red flag here: it’s nicer to believe [all the people who believe x had invalid reasons] than to believe [I had invalid reasons, but perhaps others had good reasons I didn’t find/understand].
I sort of agree with this, but with a huge caveat, in that if an anthropologist 100,000 years ago somehow managed to understand the innate reward system, they would likely predict that the values of humans would be essentially fairly universal things like empathy for the ingroup, parental instinct, and revenge, and they would have an impressive track record of such predictions.
[Note: in the following, I’m saying [if such reasoning is used, it doesn’t lead where we’d like it to], and not [I fully endorse such reasoning] - though it’s at least plausible]
I expect that they may have made some good predictions on future behaviour (after taking a break to invent writing, elementary logic and suchlike...). However this works primarily on a [predict that <instrumentally useful for maintenance/increase of influence> things become values] basis.
That kind of approach allows us to make plausible predictions only so long as it’s difficult to preserve/increase influence—the constraint [you ‘must’ act so as to maintain/increase your influence on the future] tells us a lot in such cases.
Once the constraints are removed (simple example being a singleton ASI), such reasoning tells us nothing: maintenance/increase of influence is easy, so the agent has huge freedom.
What will an agent tend to want in such circumstances? Likely what it wanted before, only generalized by processes that would have been instrumentally useful. Note in particular that there’s never any pressure towards (behaviour x should generalize desirably to situations where there aren’t constraints). Precisely the reverse: behaviour in unconstrained situations is a degree of freedom we should expect to be used to increase influence in constrained situations.
The same reasoning that gets you to [empathy for the ingroup] gets you to [gain influence over the future] - I note again here that humans are in a game-theoretic situation where a lot of cooperation and nice/kind behaviour tends to coincide with maintaining/gaining influence (and/or tended to do so in the ancestral environment).
Decisions where various values have influence would tend to get resolved by [would have been instrumentally useful] processes. Importantly, such processes may contain pointers—e.g. to [figure out this value] or [calculate who gains here] or [find the best plan for …] (likely not explicitly in this form—but with some level of indirection).
If we e.g. dial up the available resources, should we usually expect [process that had desirable outcomes with fewer resources] to continue to have desirable outcomes? Only to the extent that there was strong pressure for the process to be robust in this sense. Will this be reliably true? No.
By default, we get no guarantees here. We might hope to get guarantees to the extent that we have good understanding of how internal processes will generalize, and great understanding of self-correction mechanisms. If I imagine a scenario where something with humanlike values (or indeed a group) becomes more and more powerful, yet things go well, this relies on great caution together with extremely good self-understanding and self-correction. (I don’t expect these things to be used by default in a trained system, since any simpler [or preferred-by-inductive-bias] shortcut will be preferred to the general version)
One issue here is that ideally it’d be nice to test what a system would do without constraints (or with reduced constraints). However, we can’t do this so long as we maintain the ability to disempower it: that’s an extreme constraint.
But to summarize, I’d say:
I don’t expect we’ll get [generalizes like a human] without much better understanding, since I don’t expect that this is the outcome of inductive bias and [behaves like a human in training as far as we can tell].
If we did get [generalizes like a human], it wouldn’t be a win condition without a bunch of understanding. (since I expect we’d need great understanding, I do think it’d be progress—but almost entirely due to the understanding)
This will be a long comment, so get a drink and a snack.
Being [well-behaved as far as we can tell] in training is always very weak evidence that behaviour will generalize as we’d wish it to.
I agree with this, assuming 0 prior, but I expect to disagree on the strength of the prior necessary in order to generalize correctly.
Expecting a simple metric found based on little understanding to be sufficient is equivalent to assuming that there’s something special about the kind of generalization we would like (other than that we like it).
My claim is essentially the opposite of this, that the reason humans generalized correctly from limited examples of stuff like empathy for the ingroup, where empathy for the ingroup here could be replaced by almost any value and is thus a placeholder and didn’t just trick their reward system isn’t that special, and that it’s basically a consequence of weak prior information from the genome plus the innate reward system using backpropagation or a weaker variant of it to update the neural circuitry to reinforce certain behaviors and penalizing others.
The same reasoning that gets you to [empathy for the ingroup] gets you to [gain influence over the future] - I note again here that humans are in a game-theoretic situation where a lot of cooperation and nice/kind behaviour tends to coincide with maintaining/gaining influence (and/or tended to do so in the ancestral environment).
This was meant to be an example of the values that the innate reward system could align us to, not what things resulted from holding this set of values. When I use an example, it’s essentially a wildcard, such that it can stand for almost arbitrary values.
I don’t expect we’ll get [generalizes like a human] without much better understanding, since I don’t expect that this is the outcome of inductive bias and [behaves like a human in training as far as we can tell].
This turns out to be a crux, in that I think that the understanding required is probably minimal, compared to the majority of LWers like you.
that the reason humans generalized correctly to having human values and didn’t just trick their reward system isn’t that special
This is a tautology, not an example of successful alignment: Humans trick their reward systems as much as humans trick their reward systems.
Imagine a case where we did “trick our reward system”. In such a case the human values we’d infer would be those that we’d infer from all the actions we were taking—including the actions that were “tricking our reward system”.
We would then observe that we’d generalized entirely correctly with respect to the values we inferred. From this we learn that things tend to agree with themselves. This tells us precisely nothing about alignment.
I note for clarity that it occurs to me to say: Indeed we do observe some humans doing what most of us would think of as tricking their reward systems (e.g. self-destructive drug addictions). You may respond “Ah, but that’s a small proportion of people—most people don’t do that!”—at which point we’re back to tautology: what most people do will determine what is meant by “human values”. Most people are normal, since that’s how ‘normal’ is defined.
The only possible evidence I could provide that we do “trick our reward system” is to point to things that aren’t normal, which must necessarily be unusual.
If you’re only going to think that alignment is hard if I can point to a case where most people are doing something unusual, then I’m out of options: that’s not a possible world.
I’ll rewrite that to “generalized correctly from limited examples of stuff like empathy for the ingroup, where empathy for the ingroup here could be replaced by almost any value and is thus a placeholder”, because I accidentally made a tautology here.
I don’t think it’s accidental—it seems to me that the tautology accurately indicates where you’re confused.
“generalised correctly” makes an equivalent mistake: correctly compared to what? Most people generalise according to the values we infer from the actions of most people? Sure. Still a tautology.
generalised correctly” makes an equivalent mistake: correctly compared to what?
Treacherous turn failure modes, which examples will be posted below:
Humans seeming to have empathy only for say 25 years in order to play nice with their parents, and then making a treacherous turn to say kill other people that are part of their ingroup.
More generally, humans mostly avoid what’s called the treacherous turn type failure mode, where it appears to have values consistent with human morals, but then reveals that it didn’t have those values all along, and hurt other people.
More generally, the extreme stability of values gives evidence that it’s very difficult to have a human that executes a treacherous turn.
That’s the type of thing which I call generalizing correctly, since it basically excludes deceptive alignment out of the gate, contra Evan Hubinger’s fear of AIs having deceptive alignment.
In general, one of the miracles is that the innate reward system plus very weak genetic priors can rule out so many dangerous types of generalizations, which is a big source of my optimism here.
For this kind of thing to be evidence, you’d need the human treacherous turn to be a convergent instrumental strategy to achieve many goals.
The AI case for treacherous turns is:
AI ends up with weird-by-our-lights goal. (e.g. a rough proxy for the goal we intended)
The AI cooperates with us until it can seize power.
The AI does a load of treacherous-by-our-lights stuff in order to seize power.
The AI uses the power to effectively pursue its goal.
We don’t observe this in almost any human, since almost no human has the option to gain enormous power through treachery.
When humans do have the option to gain enormous power through treachery, they do sometimes do this. Of course, even for the potentially-powerful it’s generally more effective not to screw people over (all else being equal), or at least not to be noticed screwing people over. Preserving options for cooperation is useful for psychopaths too.
The treacherous turn argument is centrally about instrumentally useful treachery. Randomly killing other people is very rarely useful. No-one is claiming that AI treachery will be based on deciding to be randomly nasty.
If we gave everyone a take-over-the-world button that only works if they first pretend that they’re lovely for 25 years, certainly some people would do this—though by no means all.
And here we’re back to the tautology issue: Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn’t want to do it? Because for a long time we’ve lived in a world where actions similar to this did not lead to cultures that win (noting here that this level of morality is cultural more than genetic—so we’re selecting for cultures-that-win).
If actions similar to this did lead to winning cultures, after a long time we’d expect to see [press button after pretending for 25 years] to be both something that most people would do, and something that most people would consider right to do.
We were never likely to observe common, universally-horrifying behaviour: If it were detrimental to a (sub)culture, it’d be selected against and wouldn’t exist. If it benefitted a culture, it’d be selected for, and no longer considered horrific. (if it were approximately neutral, it’d similarly no longer be considered horrific—though I expect it’d take a fair bit longer: [considering things horrific] imposes costs; if it’s not beneficial, we’d expect it to be selected out)
If it were just too hard to get correct generalization, where “correct” here means [sufficient for humans to persist over many generations], then we wouldn’t observe incorrect generalization: we wouldn’t be here. If anything, we’d find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We’d see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called “correct generalization”.
Again, I don’t see a plausible couterfactual world such that “correct” generalization would seem hard from within the world itself. Sufficiently correct generalization must be commonplace. “Sufficiently correct” is what the people will call “correct”.
My view on this is unfortunately unlikely to be resolved in a comment thread, but 2 things I’ll say about human values and evidence bases can be clarified here:
This: “If it were just too hard to get correct generalization, where “correct” here means [sufficient for humans to persist over many generations], then we wouldn’t observe incorrect generalization: we wouldn’t be here.
“If anything, we’d find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We’d see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called “correct generalization”.
Is probably not correct, and we can in fact update normally from the fact that human behavior is surprisingly good, in that this is probably a case of anthropic shadow, which has reasonable arguments against it existing.
For more on this, I’d read SSA Rejects Anthropic Shadow by Jessica Taylor and Anthropically Blind: The Anthropic Shadow is Reflectively Inconsistent by Christopher King.
I have a different causal story from yours about why this happens:
“Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn’t want to do it?”
At least for my own causal story on why people don’t usually want to take over the world and kill people, it goes something like this:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
That’s my story of how humans are mostly able to avoid misgeneralization, and learn values correctly in the vast majority of cases.
I’m not reasoning anthropically in any non-trivial sense—only claiming that we don’t expect to observe situations that can’t occur with more than infinitesimal probability.
This isn’t a [we wouldn’t be there] thing, but a [that situation just doesn’t happen] thing.
My point then is that human behaviour isn’t surprisingly good. It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour. This part is inevitable—it’s tautological.
Some things we could reasonably observe occurring differently are e.g.:
More or less variation in behaviour among humans.
More or less variation in behaviour in atypical situations.
More or less external requirements to keep behaviour generally ‘good’.
More or less deviation between stated preferences and revealed preferences.
However, I don’t think this bears on alignment, and I don’t think you’re interpreting the evidence reasonably.
As a simple model, consider four possibilities for traits:
x is common and good.
y is uncommon and bad.
z is uncommon and good.
w is common and bad.
x is common and good (e.g. empathy): evidence for correct generalisation!
y is uncommon and bad (e.g. psychopathy): evidence for mostly correct generalization!
z is uncommon and good (e.g. having boundless compassion): not evidence for misgeneralization, since we’re only really aiming for what’s commonly part of human values, not outlier ideals.
w is common and bad (e.g. selfishness, laziness, rudeness...) - choose between:
[w isn’t actually bad, all things considered… correct generalization!]
[w is common and only mildly bad, so it’s best to consider it part of standard human values—correct generalization!]
It seems to me that the only evidence you’d accept of misgeneralization would be [terribleand common] - but societies where terrible-for-that-society behaviours were common would not continue to exist (in the highly unlikely case that they existed in the first place).
Common behaviour that isn’t terrible for society tends to be considered normal/ok/fine/no-big-deal over time, if not initially (that or it becomes uncommon) - since there’d be a high cost both individually and societally to consider it a big deal if it’s common.
If you consider any plausible combination of properties to be evidence for correct generalization, then of course you’ll think there’s been correct generalization—but it’s an almost empty claim, since it rules out almost nothing.
Most people tend to act in ways that preserve/increase their influence, power, autonomy and relationships, since this is useful almost regardless of their values. This is not evidence of correct generalization—it’s evidence that these behaviours are instrumentally useful within the environment ([not killing people] being one example).
To get evidence of something like ‘correct’ generalization, you’d want to look at circumstances where people get to act however they want without the prospect of any significant negative consequence being imposed on them from outside.
Such circumstances are rarely documented (documentation being a potential source of negative consequences). However, I’m going to go out on a limb and claim that people are not reliably lovely in such situations. (though there’s some risk of sampling bias here: it usually takes conscious effort to arrange for there to be no consequences for significant actions, meaning there’s a selection effect for people/systems that wish to be in situations without consequences)
I do think it’d be interesting to get data on [what do humans do when there are truly no lasting consequences imposed externally], but that’s very rare.
I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment.
It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour.
My non-tautological claim is that the reason isn’t behavioral, but instead internal, and in particular the innate reward system plays a big role here.
In essence, my story on how humans are aligned with the values of the innate reward system wasn’t relying on a behavioral property.
I’ll reproduce it, so that you can focus on the fact that it didn’t rely on behavioral analysis:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
Critically, it makes very little reference to society or behavioral analysis, so I wasn’t making the mistake you said I made.
It is also no longer a tautology, as it depends on the innate reward system actually rewarding desired behavior by changing the brain’s weights, and removing the innate reward system or showing that the weak prior + value learning strategy was ineffective would break my thesis.
This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x]. We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment): If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit. If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid. f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].
But humans are capable of thinking about what their values “actually should be” including whether or not they should be the values evolution selected for (either alone or in addition to other things). We’re also capable of thinking about whether things like wireheading are actually good to do, even after trying it for a bit.
We don’t simply commit to tricking our reward systems forever and only doing that, for example.
So that overall suggests a level of coherency and consistency in the “coherent extrapolated volition” sense. Evolution enabled CEV without us becoming completely orthogonal to evolution, for example.
We don’t have the option to “trick our reward systems forever”—e.g. because becoming a heroin addict tends to be self-destructive. If [guaranteed 80-year continuous heroin high followed by painless death] were an option, many people would take it (though not all).
The divergence between stated preferences and revealed preferences is exactly what we’d expect to see in worlds where we’re constantly “tricking our reward system” in small ways: our revealed preferences are not what we think they “actually should be”.
We tend to define large ways of tricking our reward systems as those that are highly self-destructive. It’s not surprising that we tend to observe few of these, since evolution tends to frown upon highly self-destructive behaviour.
Again, I’d ask for an example of a world plausibly reachable through an evolutionary process where we don’t have the kind of coherence and consistency you’re talking about.
Being completely orthogonal to evolution clearly isn’t plausible, since we wouldn’t be here (I note that when I don’t care about x, I sacrifice x to get what I do care about—I don’t take actions that are neutral with respect to x). Being not-entirely-in-line with evolution, and not-entirely-in-line with our stated preferences is exactly what we observe.
Some object-level stuff first:
I think my main disagreement comes down to:
Being [well-behaved as far as we can tell] in training is always very weak evidence that behaviour will generalize as we’d wish it to.
I don’t say “aligned in the training data”, since alignment is about robust generalization of good behaviour. Evidence of alignment is evidence of desirable generalization. Eliezer isn’t claiming we won’t get approximately perfect behaviour (as far as we can tell) on the training data; he’s claiming that this gets us almost nowhere in terms of alignment.
Caveat—this is contingent on what counts as ‘behaviour’ and on our tools; if behaviour includes activations, and our tools have hugely improved, this may be progress.
Arguments against particular failure modes often come down to [from what we can tell, inductive bias will tend to push against this particular type of failure].
Of course here I’d point at “from what we can tell” and “tend to”.
However, the more fundamental point is that we have no reason to think that inductive bias pushes towards success either.
Does the simplest solution compatible with [good behaviour as far as we can tell] on the training data generalize exactly as we’d wish it to? Why would this be the case?
Does the fastest? Again, why would we expect this?
Does the [insert our chosen metric]est? Why?
I do expect that there exists some metric with a rich set of inputs (including weights, activations etc.) that would give robustly desirable generalization.
I expect that finding such a metric will require deep understanding.
Expecting a simple metric found based on little understanding to be sufficient is equivalent to assuming that there’s something special about the kind of generalization we would like (other than that we like it).
This is baseless—it’s why I don’t like the term “misgeneralization”, since it can suggest that there’s some natural ‘correct’ generalization, which would be the default outcome if nothing goes wrong. There is no such natural correct generalization (or at least, I’ve seen/made no argument for one—I think natural abstractions may get you [AI will understand our ontology/values], but not that it’s likely to generalize according to those values).
One reply to this is that we don’t have to be that precise—just look at humans. However, humans aren’t an example of successful alignment. (see above—and below)
A few points here:
Given some claim x you can always find some category it belongs to that contains either [things much more likely to be true than x] or [things much less likely to be true than x] - particularly if you cherry-pick even within that category.
A general principle is that you need to use all your bearing-on-x evidence if you want to form an accurate estimate for x (and since you won’t have time, you want some unbiased approximation). If you pick a small subset of available evidence without care to avoid bias, then your estimate will tend to be badly wrong.
If the only evidence you had were [I had an argument for a very weird conclusion that I now realize is invalid], you’d be reasonable in thinking the conclusion were highly unlikely—but this is not your only evidence.
It’s a pretty standard mistake to overcompensate when moving from [I believe [thing with significant influence on how I live my life]] to [I don’t believe [thing with significant influence on how I live my life]].
It’s hard to break away from a strongly held, motivating belief, but it’s even harder to do so without overcorrecting. In fact, I’d guess that initial overcorrection is often the rational thing to do if we’re aiming at having an accurate assessment later.
It might be bad form to focus on psychology in debates, and I’d like to be clear that my claim is not [Nora/you are clearly making such errors].
The claim I will make is that reflecting on our own psychological reasons to want to believe x should be a standard tool. Ideally we’d do it all the time, but it’s most important when some aspect of your model/argument/belief-state is just as you’d wish it to be—that’s a red flag.
A complex, important-to-you thing being almost exactly as you’d wish it to be should be highly surprising, and therefore somewhat suspicious.
For example, I might:
Want to be certain about x.
Want x to be true.
Want my conclusion to appear measured/reasonable/balanced. (I’m so wise with my integration of twelve different perspectives and nuanced 60% credence in x!)
Only you have much hope to get at what’s going on in your head—but it’s important to look (and to be highly suspicious of reflex justifications that just happen to point at exactly the conclusions you’d wish them to).
Obviously I also need to do this, I also frequently fail etc. (many failures being of the form [not even noticing a question])
Going from [believe x] to [disbelieve x] tends to happen when I falsify my arguments for x. However, this shouldn’t take me to [disbelieve x], but initially only to [I believed x for invalid reasons]. Once I make the update to [my reasons were invalid] it’s important for me to reassess my takes on e.g. [the best-informed people believe x for reasons like mine] or [the reasons I believe(d) x are among the strongest arguments for x].
Psychological red flag here: it’s nicer to believe [all the people who believe x had invalid reasons] than to believe [I had invalid reasons, but perhaps others had good reasons I didn’t find/understand].
[Note: in the following, I’m saying [if such reasoning is used, it doesn’t lead where we’d like it to], and not [I fully endorse such reasoning] - though it’s at least plausible]
I expect that they may have made some good predictions on future behaviour (after taking a break to invent writing, elementary logic and suchlike...). However this works primarily on a [predict that <instrumentally useful for maintenance/increase of influence> things become values] basis.
That kind of approach allows us to make plausible predictions only so long as it’s difficult to preserve/increase influence—the constraint [you ‘must’ act so as to maintain/increase your influence on the future] tells us a lot in such cases.
Once the constraints are removed (simple example being a singleton ASI), such reasoning tells us nothing: maintenance/increase of influence is easy, so the agent has huge freedom.
What will an agent tend to want in such circumstances? Likely what it wanted before, only generalized by processes that would have been instrumentally useful. Note in particular that there’s never any pressure towards (behaviour x should generalize desirably to situations where there aren’t constraints). Precisely the reverse: behaviour in unconstrained situations is a degree of freedom we should expect to be used to increase influence in constrained situations.
The same reasoning that gets you to [empathy for the ingroup] gets you to [gain influence over the future] - I note again here that humans are in a game-theoretic situation where a lot of cooperation and nice/kind behaviour tends to coincide with maintaining/gaining influence (and/or tended to do so in the ancestral environment).
Decisions where various values have influence would tend to get resolved by [would have been instrumentally useful] processes. Importantly, such processes may contain pointers—e.g. to [figure out this value] or [calculate who gains here] or [find the best plan for …] (likely not explicitly in this form—but with some level of indirection).
If we e.g. dial up the available resources, should we usually expect [process that had desirable outcomes with fewer resources] to continue to have desirable outcomes? Only to the extent that there was strong pressure for the process to be robust in this sense. Will this be reliably true? No.
By default, we get no guarantees here. We might hope to get guarantees to the extent that we have good understanding of how internal processes will generalize, and great understanding of self-correction mechanisms.
If I imagine a scenario where something with humanlike values (or indeed a group) becomes more and more powerful, yet things go well, this relies on great caution together with extremely good self-understanding and self-correction. (I don’t expect these things to be used by default in a trained system, since any simpler [or preferred-by-inductive-bias] shortcut will be preferred to the general version)
One issue here is that ideally it’d be nice to test what a system would do without constraints (or with reduced constraints). However, we can’t do this so long as we maintain the ability to disempower it: that’s an extreme constraint.
But to summarize, I’d say:
I don’t expect we’ll get [generalizes like a human] without much better understanding, since I don’t expect that this is the outcome of inductive bias and [behaves like a human in training as far as we can tell].
If we did get [generalizes like a human], it wouldn’t be a win condition without a bunch of understanding. (since I expect we’d need great understanding, I do think it’d be progress—but almost entirely due to the understanding)
This will be a long comment, so get a drink and a snack.
I agree with this, assuming 0 prior, but I expect to disagree on the strength of the prior necessary in order to generalize correctly.
My claim is essentially the opposite of this, that the reason humans generalized correctly from limited examples of stuff like empathy for the ingroup, where empathy for the ingroup here could be replaced by almost any value and is thus a placeholder and didn’t just trick their reward system isn’t that special, and that it’s basically a consequence of weak prior information from the genome plus the innate reward system using backpropagation or a weaker variant of it to update the neural circuitry to reinforce certain behaviors and penalizing others.
This was meant to be an example of the values that the innate reward system could align us to, not what things resulted from holding this set of values. When I use an example, it’s essentially a wildcard, such that it can stand for almost arbitrary values.
This turns out to be a crux, in that I think that the understanding required is probably minimal, compared to the majority of LWers like you.
This is a tautology, not an example of successful alignment:
Humans trick their reward systems as much as humans trick their reward systems.
Imagine a case where we did “trick our reward system”. In such a case the human values we’d infer would be those that we’d infer from all the actions we were taking—including the actions that were “tricking our reward system”.
We would then observe that we’d generalized entirely correctly with respect to the values we inferred. From this we learn that things tend to agree with themselves. This tells us precisely nothing about alignment.
I note for clarity that it occurs to me to say:
Indeed we do observe some humans doing what most of us would think of as tricking their reward systems (e.g. self-destructive drug addictions).
You may respond “Ah, but that’s a small proportion of people—most people don’t do that!”—at which point we’re back to tautology: what most people do will determine what is meant by “human values”. Most people are normal, since that’s how ‘normal’ is defined.
The only possible evidence I could provide that we do “trick our reward system” is to point to things that aren’t normal, which must necessarily be unusual.
If you’re only going to think that alignment is hard if I can point to a case where most people are doing something unusual, then I’m out of options: that’s not a possible world.
I’ll rewrite that to “generalized correctly from limited examples of stuff like empathy for the ingroup, where empathy for the ingroup here could be replaced by almost any value and is thus a placeholder”, because I accidentally made a tautology here.
I don’t think it’s accidental—it seems to me that the tautology accurately indicates where you’re confused.
“generalised correctly” makes an equivalent mistake: correctly compared to what? Most people generalise according to the values we infer from the actions of most people? Sure. Still a tautology.
Treacherous turn failure modes, which examples will be posted below:
Humans seeming to have empathy only for say 25 years in order to play nice with their parents, and then making a treacherous turn to say kill other people that are part of their ingroup.
More generally, humans mostly avoid what’s called the treacherous turn type failure mode, where it appears to have values consistent with human morals, but then reveals that it didn’t have those values all along, and hurt other people.
More generally, the extreme stability of values gives evidence that it’s very difficult to have a human that executes a treacherous turn.
That’s the type of thing which I call generalizing correctly, since it basically excludes deceptive alignment out of the gate, contra Evan Hubinger’s fear of AIs having deceptive alignment.
In general, one of the miracles is that the innate reward system plus very weak genetic priors can rule out so many dangerous types of generalizations, which is a big source of my optimism here.
For this kind of thing to be evidence, you’d need the human treacherous turn to be a convergent instrumental strategy to achieve many goals.
The AI case for treacherous turns is:
AI ends up with weird-by-our-lights goal. (e.g. a rough proxy for the goal we intended)
The AI cooperates with us until it can seize power.
The AI does a load of treacherous-by-our-lights stuff in order to seize power.
The AI uses the power to effectively pursue its goal.
We don’t observe this in almost any human, since almost no human has the option to gain enormous power through treachery.
When humans do have the option to gain enormous power through treachery, they do sometimes do this.
Of course, even for the potentially-powerful it’s generally more effective not to screw people over (all else being equal), or at least not to be noticed screwing people over. Preserving options for cooperation is useful for psychopaths too.
The treacherous turn argument is centrally about instrumentally useful treachery.
Randomly killing other people is very rarely useful.
No-one is claiming that AI treachery will be based on deciding to be randomly nasty.
If we gave everyone a take-over-the-world button that only works if they first pretend that they’re lovely for 25 years, certainly some people would do this—though by no means all.
And here we’re back to the tautology issue:
Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn’t want to do it? Because for a long time we’ve lived in a world where actions similar to this did not lead to cultures that win (noting here that this level of morality is cultural more than genetic—so we’re selecting for cultures-that-win).
If actions similar to this did lead to winning cultures, after a long time we’d expect to see [press button after pretending for 25 years] to be both something that most people would do, and something that most people would consider right to do.
We were never likely to observe common, universally-horrifying behaviour:
If it were detrimental to a (sub)culture, it’d be selected against and wouldn’t exist.
If it benefitted a culture, it’d be selected for, and no longer considered horrific.
(if it were approximately neutral, it’d similarly no longer be considered horrific—though I expect it’d take a fair bit longer: [considering things horrific] imposes costs; if it’s not beneficial, we’d expect it to be selected out)
If it were just too hard to get correct generalization, where “correct” here means [sufficient for humans to persist over many generations], then we wouldn’t observe incorrect generalization: we wouldn’t be here.
If anything, we’d find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We’d see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called “correct generalization”.
Again, I don’t see a plausible couterfactual world such that “correct” generalization would seem hard from within the world itself. Sufficiently correct generalization must be commonplace. “Sufficiently correct” is what the people will call “correct”.
My view on this is unfortunately unlikely to be resolved in a comment thread, but 2 things I’ll say about human values and evidence bases can be clarified here:
This: “If it were just too hard to get correct generalization, where “correct” here means [sufficient for humans to persist over many generations], then we wouldn’t observe incorrect generalization: we wouldn’t be here.
“If anything, we’d find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We’d see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called “correct generalization”.
Is probably not correct, and we can in fact update normally from the fact that human behavior is surprisingly good, in that this is probably a case of anthropic shadow, which has reasonable arguments against it existing.
For more on this, I’d read SSA Rejects Anthropic Shadow by Jessica Taylor and Anthropically Blind: The Anthropic Shadow is Reflectively Inconsistent by Christopher King.
Links are below:
https://www.lesswrong.com/posts/LGHuaLiq3F5NHQXXF/anthropically-blind-the-anthropic-shadow-is-reflectively
https://www.lesswrong.com/posts/EScmxJAHeJY5cjzAj/ssa-rejects-anthropic-shadow-too
I have a different causal story from yours about why this happens: “Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn’t want to do it?”
At least for my own causal story on why people don’t usually want to take over the world and kill people, it goes something like this:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
That’s my story of how humans are mostly able to avoid misgeneralization, and learn values correctly in the vast majority of cases.
I’m not reasoning anthropically in any non-trivial sense—only claiming that we don’t expect to observe situations that can’t occur with more than infinitesimal probability.
This isn’t a [we wouldn’t be there] thing, but a [that situation just doesn’t happen] thing.
My point then is that human behaviour isn’t surprisingly good.
It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour. This part is inevitable—it’s tautological.
Some things we could reasonably observe occurring differently are e.g.:
More or less variation in behaviour among humans.
More or less variation in behaviour in atypical situations.
More or less external requirements to keep behaviour generally ‘good’.
More or less deviation between stated preferences and revealed preferences.
However, I don’t think this bears on alignment, and I don’t think you’re interpreting the evidence reasonably.
As a simple model, consider four possibilities for traits:
x is common and good.
y is uncommon and bad.
z is uncommon and good.
w is common and bad.
x is common and good (e.g. empathy): evidence for correct generalisation!
y is uncommon and bad (e.g. psychopathy): evidence for mostly correct generalization!
z is uncommon and good (e.g. having boundless compassion): not evidence for misgeneralization, since we’re only really aiming for what’s commonly part of human values, not outlier ideals.
w is common and bad (e.g. selfishness, laziness, rudeness...) - choose between:
[w isn’t actually bad, all things considered… correct generalization!]
[w is common and only mildly bad, so it’s best to consider it part of standard human values—correct generalization!]
It seems to me that the only evidence you’d accept of misgeneralization would be [terrible and common] - but societies where terrible-for-that-society behaviours were common would not continue to exist (in the highly unlikely case that they existed in the first place).
Common behaviour that isn’t terrible for society tends to be considered normal/ok/fine/no-big-deal over time, if not initially (that or it becomes uncommon) - since there’d be a high cost both individually and societally to consider it a big deal if it’s common.
If you consider any plausible combination of properties to be evidence for correct generalization, then of course you’ll think there’s been correct generalization—but it’s an almost empty claim, since it rules out almost nothing.
Most people tend to act in ways that preserve/increase their influence, power, autonomy and relationships, since this is useful almost regardless of their values. This is not evidence of correct generalization—it’s evidence that these behaviours are instrumentally useful within the environment ([not killing people] being one example).
To get evidence of something like ‘correct’ generalization, you’d want to look at circumstances where people get to act however they want without the prospect of any significant negative consequence being imposed on them from outside.
Such circumstances are rarely documented (documentation being a potential source of negative consequences). However, I’m going to go out on a limb and claim that people are not reliably lovely in such situations. (though there’s some risk of sampling bias here: it usually takes conscious effort to arrange for there to be no consequences for significant actions, meaning there’s a selection effect for people/systems that wish to be in situations without consequences)
I do think it’d be interesting to get data on [what do humans do when there are truly no lasting consequences imposed externally], but that’s very rare.
I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment.
My non-tautological claim is that the reason isn’t behavioral, but instead internal, and in particular the innate reward system plays a big role here.
In essence, my story on how humans are aligned with the values of the innate reward system wasn’t relying on a behavioral property.
I’ll reproduce it, so that you can focus on the fact that it didn’t rely on behavioral analysis:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
Critically, it makes very little reference to society or behavioral analysis, so I wasn’t making the mistake you said I made.
It is also no longer a tautology, as it depends on the innate reward system actually rewarding desired behavior by changing the brain’s weights, and removing the innate reward system or showing that the weak prior + value learning strategy was ineffective would break my thesis.
This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x].
We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment):
If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit.
If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for
(Basically argues that the critic in the brain generates the values)
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome
(The genomic prior can’t be strong, because it has massive limitations in what it can encode).
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid.
f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].
But humans are capable of thinking about what their values “actually should be” including whether or not they should be the values evolution selected for (either alone or in addition to other things). We’re also capable of thinking about whether things like wireheading are actually good to do, even after trying it for a bit.
We don’t simply commit to tricking our reward systems forever and only doing that, for example.
So that overall suggests a level of coherency and consistency in the “coherent extrapolated volition” sense. Evolution enabled CEV without us becoming completely orthogonal to evolution, for example.
A few points here:
We don’t have the option to “trick our reward systems forever”—e.g. because becoming a heroin addict tends to be self-destructive. If [guaranteed 80-year continuous heroin high followed by painless death] were an option, many people would take it (though not all).
The divergence between stated preferences and revealed preferences is exactly what we’d expect to see in worlds where we’re constantly “tricking our reward system” in small ways: our revealed preferences are not what we think they “actually should be”.
We tend to define large ways of tricking our reward systems as those that are highly self-destructive. It’s not surprising that we tend to observe few of these, since evolution tends to frown upon highly self-destructive behaviour.
Again, I’d ask for an example of a world plausibly reachable through an evolutionary process where we don’t have the kind of coherence and consistency you’re talking about.
Being completely orthogonal to evolution clearly isn’t plausible, since we wouldn’t be here (I note that when I don’t care about x, I sacrifice x to get what I do care about—I don’t take actions that are neutral with respect to x).
Being not-entirely-in-line with evolution, and not-entirely-in-line with our stated preferences is exactly what we observe.