To begin to understand how human values actually work, it is important first to understand what they are not. Specifically, what they are not is utility functions and humans are not utility function maximizers. This is quite obvious if we observe how humans act in reality, which differs strongly from a utility maximizing model. Specifically,
1.) Humans don’t seem to very strongly optimize for anything, except perhaps the fulfillment of basic drives (food, water etc)
2.) Humans often do not know what exactly they want out of life. This kind of existential uncertainty is not something a utility maximizer ever faces
3.) Human values are often contradictory and situationally dependent in practice.
4.) Humans often act against their professed values in a wide variety of circumstances.
5.) Humans often change their values (and sometimes dramatically) due to receiving new data either in the form of conversations and dialogue with people, social pressure, assimilating into a culture, or just reading and absorbing new world views.
6.) Most widely held philosophies of values and ethics and do not cache out into consequences at all. Consequentialism and utilitarianism are highly artificial doctrines that took thousands of years for humans to invent, are challenging for almost everyone to viscerally feel, and are almost never implemented in practice by real humans [2].
Most of these claims seem plausibly true of average humans today, but false about smarter (and more reflective) humans now and in the future.
On the first point, most of the mundane things that humans do involve what looks to me like pretty strong optimization; it’s just that the things they optimize for are nice-looking, normal (but often complicated) human things. Examples of people explicitly applying strong optimization in various domains: startup founders, professional athletes, AI capabilities researchers, AI alignment researchers, dating.
I believe you yourself are making huge efforts and using lots of cognitive power for the purposes of steering the future of humanity where you’d like it to go, rather then where it seems to be on track to go by default.
My own view is that the best Future of humanity involves pretty drastic re-arrangements of most of the atoms in the lightcone. Maybe you think I’m personally not likely to succeed or work very hard at actually doing this, but if I only knew more, though faster, had more time and energy… I think it becomes apparent pretty quickly where that ends up.
If that isn’t “strongly optimizing”, I’m not sure what is.
(I still care about satisfying more basic drives on the way; keeping my System I happy and well-fed. But this feels more like a constraint of the optimization problem, and a fact about what it is I’m actually optimizing for, rather than something deeper like “I’m actually a satisficer, not strongly optimizing for anything”.)
I think the idea of Coherent Extrapolated Volition captures pretty crisply what it is that I (and many others), are optimizing for. My CEV is complicated, and there might be contradictions and unknown parts of it within me, but it sure doesn’t feel situationally dependent or unknown on a meta-level.
Most of these claims seem plausibly true of average humans today, but false about smarter (and more reflective) humans now and in the future.
On the first point, most of the mundane things that humans do involve what looks to me like pretty strong optimization; it’s just that the things they optimize for are nice-looking, normal (but often complicated) human things. Examples of people explicitly applying strong optimization in various domains: startup founders, professional athletes, AI capabilities researchers, AI alignment researchers, dating.
My claim is not that humans do not optimise for outcomes—they clearly do and this is a crucial part of our intelligence. Instead, my claim is about the computational architecture of this optimisation process—that humans are primarily (but not entirely) amortised optimisers who have learnt approximations of direct optimisation through meta-RL in the PFC. This does not mean we cannot exert optimisation power, just that we are not cognitively built as utility maximisers.
Definitely different people have different levels of optimization power they can exert and can optimise more or less strongly, but on the scale of average human → true utility maximiser even the most agentic humans are probably closer to the average than the utility maximiser.
Now, there are good computational reasons for this. Actually performing direct optimisation like this is extremely computationally costly in complex and unbounded environments so we use computational shortcuts, as by and by large do our DL systems. This does not necessarily hold in the limit but seems to be the case at the moment.
My own view is that the best Future of humanity involves pretty drastic re-arrangements of most of the atoms in the lightcone. Maybe you think I’m personally not likely to succeed or work very hard at actually doing this, but if I only knew more, though faster, had more time and energy… I think it becomes apparent pretty quickly where that ends up.
Yeah so I am not claiming that this is necessarily reflectively stable and is the optimal thing to do with infinite resources. The point is that humans (and also AI systems) do not have these infinite resources in practice and hence take computational shortcuts which move them away from being pure utility maximisers (if this is actually the reflective endpoint for humanity which I am unclear of). The goal of this post isn’t to describe hypothetical strong AIs but to describe how humans form values as well as how more human-like near-term AGIs are likely to function. Success at aligning these AGIs only gets us to the first step and we will ultimately have to solve the aligning-superintelligence problem as well, but later.
I think the idea of Coherent Extrapolated Volition captures pretty crisply what it is that I (and many others), are optimizing for. My CEV is complicated, and there might be contradictions and unknown parts of it within me, but it sure doesn’t feel situationally dependent or unknown on a meta-level.
This is the point of the post! CEV is not a base-level concept. You don’t have primary reward sensors hooked up to the CEV. Nor is it a function of sensory observations. CEV is an entity that only exists in a highly abstract and linguistic/verbal latent space of your world model, and yet you claim to be aligned to it—even though it might be contradictory and have unknown parts. You value it even though the base RL in your brain does not have direct ‘hooks’ into it. Somehow, your brain has solved a complex pointers problem to get you to intrinsically care about a concept that is very far from primary rewards.
The goal of this post isn’t to describe hypothetical strong AIs but to describe how humans form values as well as how more human-like near-term AGIs are likely to function.
I think the post does a great job of explaining human value formation, as well as the architecture of human decision-making, at least mechanically.
I’m saying that neuroanatomy seems insufficient to explain how humans function in the most important situations, let alone artificial systems, near or far.
If a hedge fund trader can beat the market, or a chess grandmaster can beat their opponent, what does it matter whether the decision process they use under the hood looks more like tree search, or more like function approximation, or a combination of both?
It might matter quite a lot, if you’re trying to build a human-like AGI! If you just want to know if your AGI is capable of killing you though, both function approximation and tree search at the level humans do them (or even somewhat below that level) seem pretty deadly, if they’re pointed in the wrong direction.
Whether it’s easy or hard to point an artificial system in any particular direction is another question.
Somehow, your brain has solved a complex pointers problem to get you to intrinsically care about a concept that is very far from primary rewards.
I think you’re saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.
I’m skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.
Though, a world where such systems are easy to build is not one I’d call “benign”, since if it’s easy to “just ask for alignment”, it’s probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we’re not in that world, though.
I think you’re saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.
I’m skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.
I think our biggest crux is this. My idea here is that by default we get systems that look like this—DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AGI. Almost any near-term AGI will almost certainly look ‘human-like’ in a way—some combination of model-free and model-based RL wrapped around an unsupervised world model. In the even nearer-term you might even scale to AGI with pure AutoGPT-style agents which are just doing iterative planning by conditioning the LLM! Both potential AGI designs look way closer to human-like than a pure EY-style utility maximiser. Now EY might still be right in the limit of super intelligence and RSI but that is not what near-term systems seem likely to look like.
Though, a world where such systems are easy to build is not one I’d call “benign”, since if it’s easy to “just ask for alignment”, it’s probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we’re not in that world, though.
Yeah I completely agree with this point and I think this is going to be almost inevitable for any alignment strategy. As a consequence of orthogonality thesis, it is likely that given you can align a system at all then you can choose to align it to something bad—like making people suffer—if you want to. I think this is true across almost all worlds—and so we definitely get increasing p(s-risk) along with increased p(survival). This is not a problem technical alignment can solve but instead needs to involve some level of societal agreement / governance.
I think our biggest crux is this. My idea here is that by default we get systems that look like this—DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AGI. Almost any near-term AGI will almost certainly look ‘human-like’ in a way—some combination of model-free and model-based RL wrapped around an unsupervised world model.
Agree this is a crux. A few remarks:
Structural similarity doesn’t necessarily tell us a lot about a system’s macro-level behavior. Examples: Stockfish 1 vs. Stockfish 20, the brain of a supervillain vs. the brain of an average human, a transformer model with random weights vs. one trained to predict the next token in a sequence of text.
Or, if you want to extend the similarity to the training process, a transformer model trained on a corpus of text from the human internet vs. one trained on a corpus of text from an alien internet. An average human vs. a supervillain who have 99%+ identical life experiences from birth. Stockfish implemented by a beginner programmer vs. a professional team.
I’d say, to the extent that current DL systems are structurally similar to human brains, it’s because these structures are instrumentally useful for doing any kind of useful work, regardless of how “values” in those systems are formed, or what those values are. And as you converge towards the most useful structures, there is less room left over for the system to “look similar” to humans, unless humans are pretty close to performing cognition optimally already.
Also, a lot of the structural similarity is in the training process of the foundation models that make up one component of a larger artificial system. The kinds of things people do with LangChain today don’t seem similar in structure to any part of a single human brain, at least to me. For example, I can’t arrange a bunch of copies of myself in a chain or tree, and give them each different prompts running in parallel. I could maybe simulate that by hiring a bunch of people, though it would be OOMs slower and more costly.
I also can’t add a python shell or a “tree search” method, or perform a bunch of experimental neurosurgery on humans, the way I can with artificial systems. These all seem like capabilities-enhancing tools that don’t preserve structural similarity to humans, and may also not preserve similarity of values to the original, un-enhanced artificial system.
Most of these claims seem plausibly true of average humans today, but false about smarter (and more reflective) humans now and in the future.
On the first point, most of the mundane things that humans do involve what looks to me like pretty strong optimization; it’s just that the things they optimize for are nice-looking, normal (but often complicated) human things. Examples of people explicitly applying strong optimization in various domains: startup founders, professional athletes, AI capabilities researchers, AI alignment researchers, dating.
I believe you yourself are making huge efforts and using lots of cognitive power for the purposes of steering the future of humanity where you’d like it to go, rather then where it seems to be on track to go by default.
My own view is that the best Future of humanity involves pretty drastic re-arrangements of most of the atoms in the lightcone. Maybe you think I’m personally not likely to succeed or work very hard at actually doing this, but if I only knew more, though faster, had more time and energy… I think it becomes apparent pretty quickly where that ends up.
If that isn’t “strongly optimizing”, I’m not sure what is.
(I still care about satisfying more basic drives on the way; keeping my System I happy and well-fed. But this feels more like a constraint of the optimization problem, and a fact about what it is I’m actually optimizing for, rather than something deeper like “I’m actually a satisficer, not strongly optimizing for anything”.)
I think the idea of Coherent Extrapolated Volition captures pretty crisply what it is that I (and many others), are optimizing for. My CEV is complicated, and there might be contradictions and unknown parts of it within me, but it sure doesn’t feel situationally dependent or unknown on a meta-level.
My claim is not that humans do not optimise for outcomes—they clearly do and this is a crucial part of our intelligence. Instead, my claim is about the computational architecture of this optimisation process—that humans are primarily (but not entirely) amortised optimisers who have learnt approximations of direct optimisation through meta-RL in the PFC. This does not mean we cannot exert optimisation power, just that we are not cognitively built as utility maximisers.
Definitely different people have different levels of optimization power they can exert and can optimise more or less strongly, but on the scale of average human → true utility maximiser even the most agentic humans are probably closer to the average than the utility maximiser.
Now, there are good computational reasons for this. Actually performing direct optimisation like this is extremely computationally costly in complex and unbounded environments so we use computational shortcuts, as by and by large do our DL systems. This does not necessarily hold in the limit but seems to be the case at the moment.
Yeah so I am not claiming that this is necessarily reflectively stable and is the optimal thing to do with infinite resources. The point is that humans (and also AI systems) do not have these infinite resources in practice and hence take computational shortcuts which move them away from being pure utility maximisers (if this is actually the reflective endpoint for humanity which I am unclear of). The goal of this post isn’t to describe hypothetical strong AIs but to describe how humans form values as well as how more human-like near-term AGIs are likely to function. Success at aligning these AGIs only gets us to the first step and we will ultimately have to solve the aligning-superintelligence problem as well, but later.
This is the point of the post! CEV is not a base-level concept. You don’t have primary reward sensors hooked up to the CEV. Nor is it a function of sensory observations. CEV is an entity that only exists in a highly abstract and linguistic/verbal latent space of your world model, and yet you claim to be aligned to it—even though it might be contradictory and have unknown parts. You value it even though the base RL in your brain does not have direct ‘hooks’ into it. Somehow, your brain has solved a complex pointers problem to get you to intrinsically care about a concept that is very far from primary rewards.
I think the post does a great job of explaining human value formation, as well as the architecture of human decision-making, at least mechanically.
I’m saying that neuroanatomy seems insufficient to explain how humans function in the most important situations, let alone artificial systems, near or far.
If a hedge fund trader can beat the market, or a chess grandmaster can beat their opponent, what does it matter whether the decision process they use under the hood looks more like tree search, or more like function approximation, or a combination of both?
It might matter quite a lot, if you’re trying to build a human-like AGI! If you just want to know if your AGI is capable of killing you though, both function approximation and tree search at the level humans do them (or even somewhat below that level) seem pretty deadly, if they’re pointed in the wrong direction.
Whether it’s easy or hard to point an artificial system in any particular direction is another question.
I think you’re saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.
I’m skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.
Though, a world where such systems are easy to build is not one I’d call “benign”, since if it’s easy to “just ask for alignment”, it’s probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we’re not in that world, though.
I think our biggest crux is this. My idea here is that by default we get systems that look like this—DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AGI. Almost any near-term AGI will almost certainly look ‘human-like’ in a way—some combination of model-free and model-based RL wrapped around an unsupervised world model. In the even nearer-term you might even scale to AGI with pure AutoGPT-style agents which are just doing iterative planning by conditioning the LLM! Both potential AGI designs look way closer to human-like than a pure EY-style utility maximiser. Now EY might still be right in the limit of super intelligence and RSI but that is not what near-term systems seem likely to look like.
Yeah I completely agree with this point and I think this is going to be almost inevitable for any alignment strategy. As a consequence of orthogonality thesis, it is likely that given you can align a system at all then you can choose to align it to something bad—like making people suffer—if you want to. I think this is true across almost all worlds—and so we definitely get increasing p(s-risk) along with increased p(survival). This is not a problem technical alignment can solve but instead needs to involve some level of societal agreement / governance.
Agree this is a crux. A few remarks:
Structural similarity doesn’t necessarily tell us a lot about a system’s macro-level behavior. Examples: Stockfish 1 vs. Stockfish 20, the brain of a supervillain vs. the brain of an average human, a transformer model with random weights vs. one trained to predict the next token in a sequence of text.
Or, if you want to extend the similarity to the training process, a transformer model trained on a corpus of text from the human internet vs. one trained on a corpus of text from an alien internet. An average human vs. a supervillain who have 99%+ identical life experiences from birth. Stockfish implemented by a beginner programmer vs. a professional team.
I’d say, to the extent that current DL systems are structurally similar to human brains, it’s because these structures are instrumentally useful for doing any kind of useful work, regardless of how “values” in those systems are formed, or what those values are. And as you converge towards the most useful structures, there is less room left over for the system to “look similar” to humans, unless humans are pretty close to performing cognition optimally already.
Also, a lot of the structural similarity is in the training process of the foundation models that make up one component of a larger artificial system. The kinds of things people do with LangChain today don’t seem similar in structure to any part of a single human brain, at least to me. For example, I can’t arrange a bunch of copies of myself in a chain or tree, and give them each different prompts running in parallel. I could maybe simulate that by hiring a bunch of people, though it would be OOMs slower and more costly.
I also can’t add a python shell or a “tree search” method, or perform a bunch of experimental neurosurgery on humans, the way I can with artificial systems. These all seem like capabilities-enhancing tools that don’t preserve structural similarity to humans, and may also not preserve similarity of values to the original, un-enhanced artificial system.