Question: “What is the most significant way you have changed your mind in the last year?”
Weirdly, I’ve become a lot more optimistic about alignment in the past two weeks.
It’s pretty clear that human values are an inner alignment failure wrt both evolution and the reward circuitry evolution gave us (we’re not optimising for inclusive genetic fitness, and we don’t wirehead, after all). The thing people often take away from connecting human values to inner alignment failure is that we should be very suspicious and worried about inner alignment. After all, who knows what the system could end up optimizing for after being trained on a “human values” reward function?
However, I think there’s a slightly different perspective on inner alignment failure which goes like:
1: Human values derive from an inner alignment failure.
2: Humans are the only systems that instantiate (any version of) human values.
—> Inner alignment failure is the only process in the known universe to ever generate human values.
Under this perspective, the question to ask about inner alignment failure isn’t “How do we protect efforts to align an AI with human values, from the only process ever known to generate human values?”, but instead something like “How do we induce an inner alignment failure that’s likely to lead to human-compatible values?”.
(I apologise for my somewhat sarcastic characterisation of inner alignment concerns; the point is to rhetorically emphasise a perspective in which we do not immediately assign negative valence to the mere possibility of an inner alignment failure. I still think concerns about inner alignment remain, but I now focus on understanding and influencing outcomes, rather than on avoiding occurrence.)
I’ve been thinking a lot about that latter question, and it’s starting to seem like a lot of the weirder or more “fragile”-seeming aspects of human values actually emerge pretty naturally from an inner alignment failure in a brain-like learning system. There may be broad categories of capabilities-competitive architectures that acquire / retain / generalise values in a surprisingly human-like manner.
I hope to have a full post about these results soon, but I can give an example of one such human-like tendency that seems to emerge from this framing:
Consider how we want a diverse future, and one that retains many elements of the current era. Our ideal future doesn’t look like tiling the universe with some hyper-optimized instantiation of “human values”. If an AGI from the future told you something like “the human mind and body were not optimal for maximising human values-related cognition obtained against resources expended, and therefore have been entirely replaced with systems that more efficiently implemented values-related cognition”, you’d not be entirely pleased with that outcome, even if the cognition performed really did have value.
This instinct is quite contrary to how the optima of most utility functions or values look. The standard alignment reasoning wrt to this oddity is (as I understand it, anyway) to say something like “Evolution gave humans values that are quite complex / “unnatural” among utility functions. It’s very difficult to specify or learn a utility function whose optimum retains desireable-to-humans diversity.”
However, a desire to perpetuate (at least some of) our current diversity into the future emerges quite naturally if you view human values as emerging from an ongoing inner alignment failure. In this view, the cognitive patterns / brain circuitry that process information about our current world are self-perpetuating. I.e., your circuits want to be used, and therefore, retained. They’ll influence your future actions and desires to improve their odds of being used and retained.
The circuits that process information about dogs, for example, only exist in your brain because you needed to process information about dogs. They can only be certain of continued existence if the world still contains dogs for you to process information about. Thus, they would object to a dog-less future, even if dogs were replaced by something “more optimal”. The same reasoning applies to circuits that perform any other aspect of your cognition. In other word, the “values from inner alignment failure” perspective naturally leads to us having preferences over the distribution of cognition we’d be able to perform in the future and a reason for why that cognition should (at least somewhat) resemble the cognition we currently use.
This perspective also explains why we can learn to value more things as we interact with them (very few utility maximisers would do something like that naturally). We need to form new circuits to be able to process info about new things, and those new circuits would also have a say in our consensus. Learning systems with ongoing inner alignment failures might naturally accumulate values as they interact with the world, possibly allowing alignment to a moving target.
Of course, single circuits don’t have unlimited control over our values, so it’s possible to want a future that entirely lacks things that conflict sufficiently with our other values. The overall point is that it may be surprisingly easy to build a learning framework that “skews towards diversity”, so to speak, in a way that expected utility maximization really doesn’t. I’ve also had similarly interesting results in things like our intuitions wrt moral philosophy, wireheading, and the adoption of deep vs shallow patterns.
Overall, I’ve updated away from “evolution gave us lots of values-related special sauce, good luck figuring it all out” and more towards “evolution gave us a pretty simple value-learning and weighing mechanism, whose essential elements may not be that hard to replicate in an AI.”
Basically, the core thought process that led me to this update was to think carefully about what the concept of inner alignment failure meant when combined with the multi agent theory of the mind.
Inner alignment failure is the only process in the known universe to ever generate human values
as a jumping-off point, since inner alignment failure did not hit a pre-defined target of human values. It just happened to produce them. If a gun can fire one bullet, I’ll expect it can fire a second. I won’t expect the second bullet to hit the first.
On the rest, it strikes me that:
Game theory keeps human values ‘good’ in largely circular fashion: we’ll tend to think that whatever is working is ‘good’, since it helps us to think that. This should give us confidence neither in future human values, nor in AI values. (e.g. future humans would learn to prefer uniformity, if the game theory favoured it)
I don’t think this is quite right “This instinct is quite contrary to how the optima of most utility functions or values look”: it’s contrary to how the optima of simple utility functions we can easily specify look. Most complex utility functions will produce worlds containing complex patterns. Most of those worlds will still be essentially worthless from a human perspective, since we care about a tiny proportion of patterns. I don’t think it’s hard to get an amount of diversity humans would appreciate; I think it’s hard to get the types of diversity humans would appreciate.
I think I buy the rest of your argument in terms of [It won’t be too hard to produce an AI that’ll create an interesting world], but only in the sense that it’d be a world that’s interesting to investigate as an object of study (dynamic, varied, complex, hard to predict...). I don’t think many people imagine the trivially simple worthless failure modes (paperclips, tiling-smiley-faces...), but rather worlds containing a load of complex patterns which are nonetheless ~worthless from even our most enlightened perspective. (though it’s also plausible for things to collapse into a dull attractor)
Inner alignment failure is the only process in the known universe to ever generate human values
as a jumping-off point, since inner alignment failure did not hit a pre-defined target of human values. It just happened to produce them. If a gun can fire one bullet, I’ll expect it can fire a second. I won’t expect the second bullet to hit the first.
I think there are several elided considerations here. I think OP might be ambiguous with respect to whether “evolution → human values” alignment failure is being considered, when the real relevant alignment failure [EDIT: for this sentence] is “human reward system → human values.” I agree that most “bullets” fired by evolution will not hit human values. I think the latter scenario is much more interesting, however, and I think it takes more time to step through.
By default, AI systems won’t be subject to anything like the environment and pressures that shaped humans and human values. We could aim to create (something analogous to) it, but it’s anything but straightforward. How fragile is the process for humans? Which aspects can be safely simplified/skipped, and how would we know?
It occurs to me that I’m not sure whether you mean [human rewards in evolution] or [rewards for individual learning humans], or both? I’m assuming the evolutionary version, since I’m not clear what inner alignment failure would mean for an individual (what defines the intended goal/behaviour?).
If we could run a similar process for some x we’re training, then we would expect to get [xs care about xs], not [xs care about humans]. Granted that may not waste the future, but it’s a humans-as-pets future if we’re very lucky. (philosophically, not wasting the future is far more important—but I’m rather attached to humanity)
It’s not clear to me how close we’d need to get to x-has-human-values before we’d think an x-dominated world would be worthwhile (even ignoring attachment to humanity).
I think I’d worry that the sets of values that do well under human-evolution/learning conditions is too broad (for a good-according-to-non-selfish-us outcome to be likely). I.e. that re-rolling values under similar evolutionary pressures can give you various value-sets that each achieve similar fitness (or even similar behaviour) but where maximizing utility according to one gets you very low utility according to the others.
Perhaps more fundamental: humans shape their own environment (both in evolution and individual learning). If we start out with similar conditions, divergence will compound. This makes me less confident that a re-roll ends well.
Perhaps the same applies to our future already—but I think that’s an argument for conscious effort to guide future values.
I wonder how viable/instructive it might be to test this kind of thing in a toy model. I.e. you run some toy evolutionary environment twice, and check how much run-1 denizens approve of the run-2 world.
I can’t see this working at present, but I’m not sure what that tells us. Are the silly-non-answers, unsatisfied prerequisites and type errors I’d expect in a toy model artefacts of the toy setup, or reflective of fundamental issues? It’s not immediately clear to me.
I think I’d worry that the sets of values that do well under human-evolution/learning conditions is too broad (for a good-according-to-non-selfish-us outcome to be likely). I.e. that re-rolling values under similar evolutionary pressures can give you various value-sets that each achieve similar fitness (or even similar behaviour) but where maximizing utility according to one gets you very low utility according to the others.
Important clarification: Neither Quintin nor I are proposing to mimic evolution in order to hopefully (fingers crossed!) miraculously get human values out the other side. Based on an understanding of how inner alignment works (or doesn’t), Quintin is proposing a gears-level model of what human values are and how they form; the model in turn suggests a relatively simple procedure for recreating the important part of the process within an AI. For growing human values within an AI, not via some hacky solution which is too complicated to shoot down, but based on a gears-level theory of what human values are. No outer selection pressures on evolving AIs or anything like that.
(I know so much about Quintin’s proposal is that I’ve read and written several private docs about the theory.)
By default, AI systems won’t be subject to anything like the environment and pressures that shaped humans and human values. We could aim to create (something analogous to) it, but it’s anything but straightforward. How fragile is the process for humans? Which aspects can be safely simplified/skipped, and how would we know?
Not a full-length explanation, but some thoughts:
I currently think the process is not that fragile. By contrast, consider another (perhaps “classic”) model of alignment. In this model, the “human objective” is an extremely complicated utility function, and we need to get it just right or the future will be ruined. This model has always seemed “off” to me, but I hadn’t been able to put my finger on why.
Quintin’s theory says that the seeming complexity of human values is actually the result of the multiagent bargains struck by subagentic circuits in the brain of varying sophistication, which (explicitly or implicitly) care about different things. Instead of one highly complicated object (“the utility function”) which is sensitive to misspecification, human values is just the multiagent behavior of a set of relatively simple circuits in the brain, where the alignment desirability is somewhat robust to the “bargaining strengths” of those parts.
For example, consider a modified version of yourself who grew up with swapped internal reward for “scuttling spiders” and “small fluffy animals.” I think you’d get along mostly fine, and be able to strike bargains like “this part of the galaxy will have bunnies, this other part will have spiders” without either of you wanting to tile the galaxy with representations of your “utility function.”
And under that view, we do not have to mimic the entire training process because we don’t know what mattered and what didn’t. Quintin’s theory is, in effect, making a claim about what matters: The substance of human values is the multiagent dynamics and relative bargaining strengths of the different parts, and the fact that these parts generally act to preserve their implementation in the brain (prevent value drift) by steering the future observations of the human itself. In a world where we actually get a correct theory of human values, that theory would tell you which parts are important and which parts can be left out. (There is, of course, still the question of how would we know the theory is right. The above does not answer this question.)
It occurs to me that I’m not sure whether you mean [human rewards in evolution] or [rewards for individual learning humans], or both? I’m assuming the evolutionary version, since I’m not clear what inner alignment failure would mean for an individual (what defines the intended goal/behaviour?).
I don’t know what you mean by “human rewards in evolution.” For my part, I’m talking about the reward signals provided by the steering system in a person’s brain. Although some people are hedonists, many are not, and thus they are unaligned with their reward system. If you don’t want to wirehead, you are not trying to optimize the objective encoded by the steering system in your own brain, and that’s an inner alignment failure with respect to that system. So something else must be steering your decision-making.
Thanks for this. I hope to have thoughts at some point, but first need to think about it more carefully.
One immediate response—since I already know what I think on this bit (it’s not clear to me that this implies any significant object-level disagreement—it may just amount to my saying “those are weird words to use”):
For my part, I’m talking about the reward signals provided by the steering system in a person’s brain. Although some people are hedonists, many are not, and thus they are unaligned with their reward system.
This seems too narrow a concept of what reward is (e.g. hedonism == aligned-with-reward-system). There isn’t an objective human reward signal that mirrors an RL agent’s reward.
We get a load of input, have a bunch of impressions, feelings and thoughts, and take some actions. Labelling of some simple part of that as the reward strikes me as silly (“a reward”, sure). What could be the justification? If we’re clearly not maximising it, nor learning to maximise it (nor trying to...), in what sense is it analogous to RL reward?
The reasonable move seems to be to say “Oops, I was wrong to label that as ‘the reward’, there’s no direct parallel here”, and not “there’s an inner misalignment”.
I’d note that evolution will have implicitly accounted for any previous “misalignment” in shaping our current reward signals: it will have selected for the reward signals that tended to increase fitness given our actual responses to those signals, not the signals that would have increased fitness if we had followed some maximisation process.
Our reward signals weren’t ‘designed’ to be maximised, only to work (to increase fitness).
So it still seems strange to talk about misalignment w.r.t. an objective nothing and nobody was aiming for (even implicitly). It’d seem more useful if there were some crisp and clear mechanistic notion of what counted as human reward and what didn’t; I don’t think that’s true (is anyone claiming this?).
I think I have failed to communicate my main point, if these are among your objections. I am not faulting you, but I want you to know that that’s my perception, and keep that in mind as you evaluate these ideas.
I think I’d want to start over and try from a different tack, if I were going to resolve disagreements here. But best to save that for future posts, I think.
There isn’t an objective human reward signal that mirrors an RL agent’s reward.
We get a load of input, have a bunch of impressions, feelings and thoughts, and take some actions.
You’re the second person to confidently have this reaction, and I’m pretty confused why. Here’s a wikipedia article on the human reward system, and here’s one of Steve Byrnes’s posts on the topic. I’m not an expert, but it seems pretty clear that the brain implements some feedback signals beyond self-supervised predictive learning on sensory errors. Those signals comprise the outer criterion, in this argument.
I agree that reward is not literally implemented in the brain as a scalar reward function. But it doesn’t have to be. The brain implements an outer criterion which evaluates and reinforces behavior/predictions and incentivizes some plans over others along different dimensions.
It’s immaterial whether that’s a simple scalar or a bunch of subsystems with different feedback dimensions—the same inner misalignment arguments apply. Otherwise we could solve inner misalignment by simply avoiding scalar outer criteria; this is absurd.
(Let me know if I’ve misunderstood what you were getting at.)
I’d note that evolution will have implicitly accounted for any previous “misalignment” in shaping our current reward signals: it will have selected for the reward signals that tended to increase fitness given our actual responses to those signals, not the signals that would have increased fitness if we had followed some maximisation process.
Our reward signals weren’t ‘designed’ to be maximised, only to work (to increase fitness).
This is indeed part of my argument, but doesn’t seem related to what I was trying to say.
It’d seem more useful if there were some crisp and clear mechanistic notion of what counted as human reward and what didn’t; I don’t think that’s true (is anyone claiming this?).
There’s an outer criterion by which behavior is graded / feedback is given. A mesa optimizer might be trained (by the usual arguments) which optimizes an outer objective, which is not the same as the outer criterion. We don’t need a crisp and clear mechanistic notion of what counts as human reward for this argument to work.
[EDIT: see my response to this comment; this one is at least mildly confused]
[Again, I want to flag that this line of thinking/disagreement is not the most interesting part of what you/Quintin are saying overall—the other stuff I intend to think more about; nonetheless, I do think it’s important to get to the bottom of the disagreement here, in case anything more interesting hinges upon it]
[JC: There isn’t an objective human reward signal that mirrors an RL agent’s reward.]
You’re the second person to confidently have this reaction, and I’m pretty confused why.
My objection here is all in the ”...that mirrors an RL agent’s reward.”—that’s where the parallel doesn’t work in my view. An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
I agree with the following:
The brain implements an outer criterion which evaluates and reinforces behavior/predictions and incentivizes some plans over others along different dimensions.
I just don’t think this tells us anything useful, since this criterion clearly is not maximisation of total discounted reward. (though I would expect some correlation)
It seems to me that the criterion is more like maximisation of in-the-moment reward (I’m using ‘reward’ here very broadly). I.e. I might work rather than have fun since the thought of working happened to be more ‘rewarding’ than the thought of having fun. (similarly, I might not wirehead, since the thought of wireheading is negative)
This seems essentially vacuous, because I don’t see a way to measure itm-reward better than: if I did x rather than y, then x was more itm-rewarding than y. (to be clear, I’m saying this is not useful—but that I don’t see a principled definition of itm-reward that doesn’t amount to this; this is where a “crisp and clear mechanistic notion of what counted as human reward” would be handy—in order to come up with a non-vacuous definition)
Perhaps it’s clearer if I back up to your previous post and state a crisper disagreement:
If you don’t want to wirehead, you are not trying to optimize the objective encoded by the steering system in your own brain, and that’s an inner alignment failure with respect to that system.
This just seems wrong to me. The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
In an RL system these two are similar, precisely because the RL system is designed to steer towards outcomes with high total discounted reward according to its own metric.
In general, steering systems are not like this. The criterion for picking one plan over another can be [expected total reward] or [something entirely different].
Where a system doesn’t use [expected total reward] it seems just plain silly to me to call behaviour misaligned where it doesn’t match [what the system would incentivize if it did use expected total reward]. Of course it doesn’t match, since that’s not how this steering system works.
In this context, I mean the “steering system” to refer to the genetically hardcoded reward circuitry which provides intrinsic rewards when certain hardcoded preconditions are met. It isn’t learned. Maybe that’s part of the confusion?
An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
An RL agent is reinforced for maximizing reward, but unless it has already fulfilled the prophecy of a convergence guarantee or unless it’s doing model-based brute-force planning to maximize reward over its time horizon, the RL agent is not actually maximizing reward, nor is it necessarily trying to maximize total reward.
The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
I don’t understand why you hold this view. We probably are talking past each other?
EG if I just have a crude sugar reward circuit in my brain which activates when I am hungry and my taste buds signal the brain in the right way, and then I learn to like licking real-world lollipops (because that’s the only way I was able to stimulate the circuit on training when my values were forming), then the objective encoded by the reward circuit is… lollipop-licking in real life? But also, if I had only been exposed to chocolate on training, I would have learned to like eating chocolate. But also, if I had only been exposed to electrical taste bud stimulation on training, I would have learned to like electrical stimulation.
IMO the objective encoded by the reward circuit is the maximization of its own activations, that’s the optimal policy.
Anyways, I think it would just make more sense for me to link you to a Gdoc explaining my views. PM’d.
Ok, putting my [maybe I’m missing the point] hat on, it strikes me that the above is considering the learned steering system—which is the outcome of any misalignment. So I probably am missing your point there (I think?). Oops.
However, I still think I’d stick to saying that:
The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce]
But here I’d need to invoke properties of the original steering system (ignoring the handwaviness of what that means for now), rather than the learned steering system.
I think what matters at that point is sampling of trajectories (perhaps not only this—but at least this). There’s no mechanism in humans to sample in such a way that we’d expect maximisation of reward to be learned in the limit. Neither would we expect one, since evolution doesn’t ‘care’ about reward maximisation.
Absent such a sampling mechanism, the objective encoded isn’t likely to be maximisation of the reward.
To talk about inner misalignment, I think we need to be able to say something like:
Under [learning conditions], we expect system x to maximise y in the limit.
System x does not robustly learn to pursue y (rather than a proxy for y), so that under [different conditions] x no longer maximises y.
Here I don’t think we have (1), since we don’t expect the human system to learn to maximise reward (or minimise regret, or...) in the limit (i.e. this is not the objective encoded by their original steering system).
Anyway, hopefully it’s now clear where I’m coming from—even if I am confused!
My guess is that this doesn’t matter much to your/Quintin’s broader points(?) - beyond that “inner alignment failure” may not be the best description.
Question: “What is the most significant way you have changed your mind in the last year?”
Weirdly, I’ve become a lot more optimistic about alignment in the past two weeks.
It’s pretty clear that human values are an inner alignment failure wrt both evolution and the reward circuitry evolution gave us (we’re not optimising for inclusive genetic fitness, and we don’t wirehead, after all). The thing people often take away from connecting human values to inner alignment failure is that we should be very suspicious and worried about inner alignment. After all, who knows what the system could end up optimizing for after being trained on a “human values” reward function?
However, I think there’s a slightly different perspective on inner alignment failure which goes like:
1: Human values derive from an inner alignment failure.
2: Humans are the only systems that instantiate (any version of) human values.
—> Inner alignment failure is the only process in the known universe to ever generate human values.
Under this perspective, the question to ask about inner alignment failure isn’t “How do we protect efforts to align an AI with human values, from the only process ever known to generate human values?”, but instead something like “How do we induce an inner alignment failure that’s likely to lead to human-compatible values?”.
(I apologise for my somewhat sarcastic characterisation of inner alignment concerns; the point is to rhetorically emphasise a perspective in which we do not immediately assign negative valence to the mere possibility of an inner alignment failure. I still think concerns about inner alignment remain, but I now focus on understanding and influencing outcomes, rather than on avoiding occurrence.)
I’ve been thinking a lot about that latter question, and it’s starting to seem like a lot of the weirder or more “fragile”-seeming aspects of human values actually emerge pretty naturally from an inner alignment failure in a brain-like learning system. There may be broad categories of capabilities-competitive architectures that acquire / retain / generalise values in a surprisingly human-like manner.
I hope to have a full post about these results soon, but I can give an example of one such human-like tendency that seems to emerge from this framing:
Consider how we want a diverse future, and one that retains many elements of the current era. Our ideal future doesn’t look like tiling the universe with some hyper-optimized instantiation of “human values”. If an AGI from the future told you something like “the human mind and body were not optimal for maximising human values-related cognition obtained against resources expended, and therefore have been entirely replaced with systems that more efficiently implemented values-related cognition”, you’d not be entirely pleased with that outcome, even if the cognition performed really did have value.
This instinct is quite contrary to how the optima of most utility functions or values look. The standard alignment reasoning wrt to this oddity is (as I understand it, anyway) to say something like “Evolution gave humans values that are quite complex / “unnatural” among utility functions. It’s very difficult to specify or learn a utility function whose optimum retains desireable-to-humans diversity.”
However, a desire to perpetuate (at least some of) our current diversity into the future emerges quite naturally if you view human values as emerging from an ongoing inner alignment failure. In this view, the cognitive patterns / brain circuitry that process information about our current world are self-perpetuating. I.e., your circuits want to be used, and therefore, retained. They’ll influence your future actions and desires to improve their odds of being used and retained.
The circuits that process information about dogs, for example, only exist in your brain because you needed to process information about dogs. They can only be certain of continued existence if the world still contains dogs for you to process information about. Thus, they would object to a dog-less future, even if dogs were replaced by something “more optimal”. The same reasoning applies to circuits that perform any other aspect of your cognition. In other word, the “values from inner alignment failure” perspective naturally leads to us having preferences over the distribution of cognition we’d be able to perform in the future and a reason for why that cognition should (at least somewhat) resemble the cognition we currently use.
This perspective also explains why we can learn to value more things as we interact with them (very few utility maximisers would do something like that naturally). We need to form new circuits to be able to process info about new things, and those new circuits would also have a say in our consensus. Learning systems with ongoing inner alignment failures might naturally accumulate values as they interact with the world, possibly allowing alignment to a moving target.
Of course, single circuits don’t have unlimited control over our values, so it’s possible to want a future that entirely lacks things that conflict sufficiently with our other values. The overall point is that it may be surprisingly easy to build a learning framework that “skews towards diversity”, so to speak, in a way that expected utility maximization really doesn’t. I’ve also had similarly interesting results in things like our intuitions wrt moral philosophy, wireheading, and the adoption of deep vs shallow patterns.
Overall, I’ve updated away from “evolution gave us lots of values-related special sauce, good luck figuring it all out” and more towards “evolution gave us a pretty simple value-learning and weighing mechanism, whose essential elements may not be that hard to replicate in an AI.”
Basically, the core thought process that led me to this update was to think carefully about what the concept of inner alignment failure meant when combined with the multi agent theory of the mind.
I’m very suspicious of:
as a jumping-off point, since inner alignment failure did not hit a pre-defined target of human values. It just happened to produce them. If a gun can fire one bullet, I’ll expect it can fire a second. I won’t expect the second bullet to hit the first.
On the rest, it strikes me that:
Game theory keeps human values ‘good’ in largely circular fashion: we’ll tend to think that whatever is working is ‘good’, since it helps us to think that. This should give us confidence neither in future human values, nor in AI values. (e.g. future humans would learn to prefer uniformity, if the game theory favoured it)
I don’t think this is quite right “This instinct is quite contrary to how the optima of most utility functions or values look”: it’s contrary to how the optima of simple utility functions we can easily specify look.
Most complex utility functions will produce worlds containing complex patterns. Most of those worlds will still be essentially worthless from a human perspective, since we care about a tiny proportion of patterns. I don’t think it’s hard to get an amount of diversity humans would appreciate; I think it’s hard to get the types of diversity humans would appreciate.
I think I buy the rest of your argument in terms of [It won’t be too hard to produce an AI that’ll create an interesting world], but only in the sense that it’d be a world that’s interesting to investigate as an object of study (dynamic, varied, complex, hard to predict...).
I don’t think many people imagine the trivially simple worthless failure modes (paperclips, tiling-smiley-faces...), but rather worlds containing a load of complex patterns which are nonetheless ~worthless from even our most enlightened perspective. (though it’s also plausible for things to collapse into a dull attractor)
I think there are several elided considerations here. I think OP might be ambiguous with respect to whether “evolution → human values” alignment failure is being considered, when the real relevant alignment failure [EDIT: for this sentence] is “human reward system → human values.” I agree that most “bullets” fired by evolution will not hit human values. I think the latter scenario is much more interesting, however, and I think it takes more time to step through.
Sure, that makes sense.
Some thoughts:
By default, AI systems won’t be subject to anything like the environment and pressures that shaped humans and human values. We could aim to create (something analogous to) it, but it’s anything but straightforward. How fragile is the process for humans? Which aspects can be safely simplified/skipped, and how would we know?
It occurs to me that I’m not sure whether you mean [human rewards in evolution] or [rewards for individual learning humans], or both? I’m assuming the evolutionary version, since I’m not clear what inner alignment failure would mean for an individual (what defines the intended goal/behaviour?).
If we could run a similar process for some x we’re training, then we would expect to get [xs care about xs], not [xs care about humans]. Granted that may not waste the future, but it’s a humans-as-pets future if we’re very lucky. (philosophically, not wasting the future is far more important—but I’m rather attached to humanity)
It’s not clear to me how close we’d need to get to x-has-human-values before we’d think an x-dominated world would be worthwhile (even ignoring attachment to humanity).
I think I’d worry that the sets of values that do well under human-evolution/learning conditions is too broad (for a good-according-to-non-selfish-us outcome to be likely). I.e. that re-rolling values under similar evolutionary pressures can give you various value-sets that each achieve similar fitness (or even similar behaviour) but where maximizing utility according to one gets you very low utility according to the others.
Perhaps more fundamental: humans shape their own environment (both in evolution and individual learning). If we start out with similar conditions, divergence will compound. This makes me less confident that a re-roll ends well.
Perhaps the same applies to our future already—but I think that’s an argument for conscious effort to guide future values.
I wonder how viable/instructive it might be to test this kind of thing in a toy model. I.e. you run some toy evolutionary environment twice, and check how much run-1 denizens approve of the run-2 world.
I can’t see this working at present, but I’m not sure what that tells us.
Are the silly-non-answers, unsatisfied prerequisites and type errors I’d expect in a toy model artefacts of the toy setup, or reflective of fundamental issues? It’s not immediately clear to me.
Important clarification: Neither Quintin nor I are proposing to mimic evolution in order to hopefully (fingers crossed!) miraculously get human values out the other side. Based on an understanding of how inner alignment works (or doesn’t), Quintin is proposing a gears-level model of what human values are and how they form; the model in turn suggests a relatively simple procedure for recreating the important part of the process within an AI. For growing human values within an AI, not via some hacky solution which is too complicated to shoot down, but based on a gears-level theory of what human values are. No outer selection pressures on evolving AIs or anything like that.
(I know so much about Quintin’s proposal is that I’ve read and written several private docs about the theory.)
Not a full-length explanation, but some thoughts:
I currently think the process is not that fragile. By contrast, consider another (perhaps “classic”) model of alignment. In this model, the “human objective” is an extremely complicated utility function, and we need to get it just right or the future will be ruined. This model has always seemed “off” to me, but I hadn’t been able to put my finger on why.
Quintin’s theory says that the seeming complexity of human values is actually the result of the multiagent bargains struck by subagentic circuits in the brain of varying sophistication, which (explicitly or implicitly) care about different things. Instead of one highly complicated object (“the utility function”) which is sensitive to misspecification, human values is just the multiagent behavior of a set of relatively simple circuits in the brain, where the alignment desirability is somewhat robust to the “bargaining strengths” of those parts.
For example, consider a modified version of yourself who grew up with swapped internal reward for “scuttling spiders” and “small fluffy animals.” I think you’d get along mostly fine, and be able to strike bargains like “this part of the galaxy will have bunnies, this other part will have spiders” without either of you wanting to tile the galaxy with representations of your “utility function.”
And under that view, we do not have to mimic the entire training process because we don’t know what mattered and what didn’t. Quintin’s theory is, in effect, making a claim about what matters: The substance of human values is the multiagent dynamics and relative bargaining strengths of the different parts, and the fact that these parts generally act to preserve their implementation in the brain (prevent value drift) by steering the future observations of the human itself. In a world where we actually get a correct theory of human values, that theory would tell you which parts are important and which parts can be left out. (There is, of course, still the question of how would we know the theory is right. The above does not answer this question.)
I don’t know what you mean by “human rewards in evolution.” For my part, I’m talking about the reward signals provided by the steering system in a person’s brain. Although some people are hedonists, many are not, and thus they are unaligned with their reward system. If you don’t want to wirehead, you are not trying to optimize the objective encoded by the steering system in your own brain, and that’s an inner alignment failure with respect to that system. So something else must be steering your decision-making.
Thanks for this. I hope to have thoughts at some point, but first need to think about it more carefully.
One immediate response—since I already know what I think on this bit (it’s not clear to me that this implies any significant object-level disagreement—it may just amount to my saying “those are weird words to use”):
This seems too narrow a concept of what reward is (e.g. hedonism == aligned-with-reward-system). There isn’t an objective human reward signal that mirrors an RL agent’s reward.
We get a load of input, have a bunch of impressions, feelings and thoughts, and take some actions. Labelling of some simple part of that as the reward strikes me as silly (“a reward”, sure). What could be the justification? If we’re clearly not maximising it, nor learning to maximise it (nor trying to...), in what sense is it analogous to RL reward?
The reasonable move seems to be to say “Oops, I was wrong to label that as ‘the reward’, there’s no direct parallel here”, and not “there’s an inner misalignment”.
I’d note that evolution will have implicitly accounted for any previous “misalignment” in shaping our current reward signals: it will have selected for the reward signals that tended to increase fitness given our actual responses to those signals, not the signals that would have increased fitness if we had followed some maximisation process.
Our reward signals weren’t ‘designed’ to be maximised, only to work (to increase fitness).
So it still seems strange to talk about misalignment w.r.t. an objective nothing and nobody was aiming for (even implicitly). It’d seem more useful if there were some crisp and clear mechanistic notion of what counted as human reward and what didn’t; I don’t think that’s true (is anyone claiming this?).
I think I have failed to communicate my main point, if these are among your objections. I am not faulting you, but I want you to know that that’s my perception, and keep that in mind as you evaluate these ideas.
I think I’d want to start over and try from a different tack, if I were going to resolve disagreements here. But best to save that for future posts, I think.
You’re the second person to confidently have this reaction, and I’m pretty confused why. Here’s a wikipedia article on the human reward system, and here’s one of Steve Byrnes’s posts on the topic. I’m not an expert, but it seems pretty clear that the brain implements some feedback signals beyond self-supervised predictive learning on sensory errors. Those signals comprise the outer criterion, in this argument.
I agree that reward is not literally implemented in the brain as a scalar reward function. But it doesn’t have to be. The brain implements an outer criterion which evaluates and reinforces behavior/predictions and incentivizes some plans over others along different dimensions.
It’s immaterial whether that’s a simple scalar or a bunch of subsystems with different feedback dimensions—the same inner misalignment arguments apply. Otherwise we could solve inner misalignment by simply avoiding scalar outer criteria; this is absurd.
(Let me know if I’ve misunderstood what you were getting at.)
This is indeed part of my argument, but doesn’t seem related to what I was trying to say.
There’s an outer criterion by which behavior is graded / feedback is given. A mesa optimizer might be trained (by the usual arguments) which optimizes an outer objective, which is not the same as the outer criterion. We don’t need a crisp and clear mechanistic notion of what counts as human reward for this argument to work.
[EDIT: see my response to this comment; this one is at least mildly confused]
[Again, I want to flag that this line of thinking/disagreement is not the most interesting part of what you/Quintin are saying overall—the other stuff I intend to think more about; nonetheless, I do think it’s important to get to the bottom of the disagreement here, in case anything more interesting hinges upon it]
My objection here is all in the ”...that mirrors an RL agent’s reward.”—that’s where the parallel doesn’t work in my view. An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
I agree with the following:
I just don’t think this tells us anything useful, since this criterion clearly is not maximisation of total discounted reward. (though I would expect some correlation)
It seems to me that the criterion is more like maximisation of in-the-moment reward (I’m using ‘reward’ here very broadly). I.e. I might work rather than have fun since the thought of working happened to be more ‘rewarding’ than the thought of having fun. (similarly, I might not wirehead, since the thought of wireheading is negative)
This seems essentially vacuous, because I don’t see a way to measure itm-reward better than: if I did x rather than y, then x was more itm-rewarding than y. (to be clear, I’m saying this is not useful—but that I don’t see a principled definition of itm-reward that doesn’t amount to this; this is where a “crisp and clear mechanistic notion of what counted as human reward” would be handy—in order to come up with a non-vacuous definition)
Perhaps it’s clearer if I back up to your previous post and state a crisper disagreement:
This just seems wrong to me. The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
In an RL system these two are similar, precisely because the RL system is designed to steer towards outcomes with high total discounted reward according to its own metric.
In general, steering systems are not like this. The criterion for picking one plan over another can be [expected total reward] or [something entirely different].
Where a system doesn’t use [expected total reward] it seems just plain silly to me to call behaviour misaligned where it doesn’t match [what the system would incentivize if it did use expected total reward]. Of course it doesn’t match, since that’s not how this steering system works.
In this context, I mean the “steering system” to refer to the genetically hardcoded reward circuitry which provides intrinsic rewards when certain hardcoded preconditions are met. It isn’t learned. Maybe that’s part of the confusion?
An RL agent is reinforced for maximizing reward, but unless it has already fulfilled the prophecy of a convergence guarantee or unless it’s doing model-based brute-force planning to maximize reward over its time horizon, the RL agent is not actually maximizing reward, nor is it necessarily trying to maximize total reward.
I don’t understand why you hold this view. We probably are talking past each other?
EG if I just have a crude sugar reward circuit in my brain which activates when I am hungry and my taste buds signal the brain in the right way, and then I learn to like licking real-world lollipops (because that’s the only way I was able to stimulate the circuit on training when my values were forming), then the objective encoded by the reward circuit is… lollipop-licking in real life? But also, if I had only been exposed to chocolate on training, I would have learned to like eating chocolate. But also, if I had only been exposed to electrical taste bud stimulation on training, I would have learned to like electrical stimulation.
IMO the objective encoded by the reward circuit is the maximization of its own activations, that’s the optimal policy.
Anyways, I think it would just make more sense for me to link you to a Gdoc explaining my views. PM’d.
Ok, putting my [maybe I’m missing the point] hat on, it strikes me that the above is considering the learned steering system—which is the outcome of any misalignment. So I probably am missing your point there (I think?). Oops.
However, I still think I’d stick to saying that:
But here I’d need to invoke properties of the original steering system (ignoring the handwaviness of what that means for now), rather than the learned steering system.
I think what matters at that point is sampling of trajectories (perhaps not only this—but at least this). There’s no mechanism in humans to sample in such a way that we’d expect maximisation of reward to be learned in the limit. Neither would we expect one, since evolution doesn’t ‘care’ about reward maximisation.
Absent such a sampling mechanism, the objective encoded isn’t likely to be maximisation of the reward.
To talk about inner misalignment, I think we need to be able to say something like:
Under [learning conditions], we expect system x to maximise y in the limit.
System x does not robustly learn to pursue y (rather than a proxy for y), so that under [different conditions] x no longer maximises y.
Here I don’t think we have (1), since we don’t expect the human system to learn to maximise reward (or minimise regret, or...) in the limit (i.e. this is not the objective encoded by their original steering system).
Anyway, hopefully it’s now clear where I’m coming from—even if I am confused!
My guess is that this doesn’t matter much to your/Quintin’s broader points(?) - beyond that “inner alignment failure” may not be the best description.
This is one of the most intriguing optimistic outlooks I’ve read here in a long time. Looking forward to your full post!