(Huh, I never saw this—maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)
I really appreciate these thoughts!
But you then propose an RL scheme. It seems to me like it’s still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
I would say “that isn’t how on-policy RL works; it doesn’t just intelligently find increasingly high-reinforcement policies; which reinforcement events get ‘exploited’ depends on the exploration policy.” (You seem to guess that this is my response in the next sub-bullets.)
While I find the particular examples intuitive, the overall claim seems too good to be true: effectively, that the path-dependencies which differentiate GD learning from ideal Bayesian learning are exactly the tool we need for alignment.
shrug, too good to be true isn’t a causal reason for it to not work, of course, and I don’t see something suspicious in the correlations. Effective learning algorithms may indeed have nice properties we want, especially if some humans have those same nice properties due to their own effective learning algorithms!
For my money, the nice properties that human and AI systems have that matter for alignment is IMO not the properties from Shard Theory, but rather several other properties that mattered:
Alignment generalizes further than capabilities because of verifying being easier to generate, as well as learning values being easier than having a lot of other real world capabilities.
It’s looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago.
The brain and DL AIs, while not the same thing, are doing reasonably similar things such that we can transport a lot of AI insights into neuroscience/human brain insights, and vice versa.
One of those lessons is the bitter lesson from Sutton applies to human values and morals, which cashes out into the fact that the data matter much more than the algorithm when predicting it’s values, especially OOD generalization of values, and thus controlling the data is basically equivalent to controlling the values.
It’s looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago
I’m not sure I like this argument very much, as it currently stands. It’s not that I believe anything you wrote in this paragraph is wrong per se, but more like this misses the mark a bit in terms of framing.
I think a lot of writing and analysis, summarized by me here, has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe “human values” in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints; I instead observe a tremendous number of unanswered questions and apparent contradictions that throw the entire edifice in disarray.
IMO, I don’t think Coherent Extrapolated Volition works, basically because I don’t expect convergence in values by default, and I agree with Steven Byrnes plus Joe Carlsmith here:
That said, I think the approximate utility function framing is actually correct, in that the GPT series (and maybe o1/o3 too) does have a utility function that’s about prediction, and we can validly turn utility functions over plans/predictions into utility functions over world states, so we can connect two different types of utility functions together, and I have commented on this before:
More generally, I have more faith than that the utility function paradigm can be reformed significantly without wholly abandoning it.
I also think to the extent human values do work, it’s at a higher level than reductionism would posit, but it is built out of lower-level parts that are changable.
I think a lot of writing and analysis, summarized by me here, has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe “human values” in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints; I instead observe a tremendous number of unanswered questions and apparent contradictions that throw the entire edifice in disarray
To handle the unanswered questions, I’ll first handle this one:
What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like “CEV” probably doesn’t make sense? The feedback loops implicit in the structure of the brain cause reward and punishment signals to “release chemicals that induce the brain to rearrange itself” in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not “update” its brain chemistry the same way that a biological being does be “human” in any decision-relevant way?). We can think about a continuous personal identity through the lens of mutual information about memories, personalities etc, but our current understanding of these topics is vastly incomplete and inadequate, and in any case the naive (yet very widespread, even on LW) interpretation of “the utility function is not up for grabs” as meaning that terminal values cannot be changed (or even make sense as a coherent concept) seems totally wrong.
I agree with the path dependency argument that morality is more path dependent than LWers think, and while controlling your value evolution is easier than predicting it, I basically agree with the claim that CEV probably makes less sense/isn’t unique.
I think the issue of the fact that that the feedback loops update the brain significantly, causing the computation graph to change and thus complicating updateleness for uploaded/embodied humans significantly is an actually real problem, and a big reason why the early updateless decision theories didn’t matter that much is because they assumed omniscience in a logical and computational sense, so it doesn’t work for real humans.
I’m not sure if this is unsolvable, and I wouldn’t say that there won’t be a satisfying implementation of UDT in the future, but yeah I would not yet bet that much on updatelessness working out very well.
(Heck, even Solomonoff induction/AIXI isn’t logically/computationally omniscient, if only because they also can’t compute certain functions that more powerful computers can)
I am not sure whether any version of updateleness in decision theory can survive realistic constraints, so I don’t know whether it is solvable, and I don’t think it matters for now.
We can think about a continuous personal identity through the lens of mutual information about memories, personalities etc, but our current understanding of these topics is vastly incomplete and inadequate, and in any case the naive (yet very widespread, even on LW) interpretation of “the utility function is not up for grabs” as meaning that terminal values cannot be changed (or even make sense as a coherent concept) seems totally wrong.
I basically agree with this, and I think the likely wrong assumption here is that an AI terminal value can be treated as fixed, and IMO the big issue I have with a lot of LW content on values is precisely that they treat a utility function of a human (when the utility function framing works, which does work sometimes but also doesn’t work other times) as fixed.
I think this post might help you on how a more realistic version of utility functions would work, which include the fact that terminal values change:
On Charlie Steiner’s view of Goodhart, quoted below, I have a pretty simple response:
There has already been a great deal of discussion about these topics on LW (1, 2, etc), and Charlie Steiner’s distillation of it in his excellently-written Reducing Goodhart sequence still seems entirely correct:
Humans don’t have our values written in Fortran on the inside of our skulls, we’re collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It’s not that there’s some pre-theoretic set of True Values hidden inside people and we’re merely having trouble getting to them—no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like “which atoms exactly count as part of the person” and “what do you do if the person says different things at different times?”
The natural framing of Goodhart’s law—in both mathematics and casual language—makes the assumption that there’s some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values.
The answer to this is basically John Wentworth’s comment, repeated here:
Ok, I think I see what you’re saying now. I am of course on board with the notion that e.g. human values do not make sense when we’re modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.
However, there is a difference between “the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction”, and “there is not a unique, well-defined referent of ‘human values’”. The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human’s model.
An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its “goals” do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different “goals” to the robot—e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.
And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. “The utility function I hard-coded into the robot” is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that “the utility function I hard-coded into the robot” would correspond to some unambiguous thing in the real world—although specifying exactly what that thing is, is an instance of the pointers problem.
Does that make sense? Am I still missing something here?
More generally, solutions to these sorts of problems come down to the fact that we can make a new abstract layer out of atomic parts, and use error-correction to make the abstraction as non-leaky as possible.
On what are human preferences, I’ll state my answer below:
To answer the question of whether we have good reason to expect utility maximization as a description of what real AI looks like, my modern answer is “Because this happened at least 2 times, and thus provides limited but useful evidence on whether utility maximization will appear.”
To answer the question of what counts as human preferences, I think all of the answers have some correctness, and human preferences are both over world states, universe histories and also plausibly have some values that aren’t reducible to the utility function view.
An important point about utility functions/coherence arguments is that for the purposes of coherence theorems/utility functions, we only care about the revealed preferences/behaviors, and thus we only need to observe their behavior to check whether or not they have a utility function that is coherent, not whether a utility function is implemented truly inside it’s head:
On second thought, even if you assume the latter, the humans you’re learning from will themselves have problems with distributional shifts. If you give someone a different set of life experiences, they’re going to end up a different person with different values, so it seems impossible to learn a complete and consistent utility function by just placing someone in various virtual environments with fake memories of how they got there and observing what they do. Will this issue be addressed in the sequence?
I think the ultimate answer to this is to reject the assumption that values are fixed for all time, no matter which arbitrary environment is used, and instead focus on learned utility functions.
I think Wei Dai is pointing at a pretty real and deep problem in how LWers think about utility functions/values, downstream of making the AIXI model of intelligence the dominant form of thought of how AI was likely to end up if it achieved ASI, which has totally fixed values in the form of a reward function it optimizes for all time, but contra Wei Dai and Rohin Shah, I think that it doesn’t doom the ambitious value learning agenda based on utility functions, since not all utility functions are non-responsive to the environment.
How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?
This sounds like you are claiming that there are good reasons to believe that the pointers problem is fundamentally impossible to solve, at least in the general case.
I’ll say a few things that are relevant here:
I don’t particularly think this matters for AI x-risk in general, and thus I will mostly punt on this question.
I think AI progress is weak evidence that something like the pointers problem is possible to solve in theory, but not that strong.
I don’t particularly think we need to solve/prove impossible today, and can defer to our future selves on this question.
I feel similarly for realism about rationality, but I’d drop my second point.
and ultimately an increased focus on the basic observation that full value alignment simply is not required for a good AI outcome (or at the very least to prevent AI takeover).
I definitely agree that full value alignment is not required for humans to thrive in a world where AIs control the economy, and this was not appreciated well enough by a lot of doomy people, primarily stemming from over-assuming values are fragile combined with assuming AIs would basically instantly takeover because of assuming that AI would FOOM from today’s intelligence to superintelligence, and while I do genuinely think that human values are simpler than Yudkowsky thought, the observation that full alignment is not required for AI safety is an underrated insight.
This insight is compatible with a world where human values are genuinely simpler than we thought.
Wei Dai (yet again) and Stuart Armstrong explained how there doesn’t seem to be a principled basis to expect “beliefs” and “values” to ultimately make sense as distinct and coherent concepts that carve reality at the joints, and also how inferring a human’s preferences merely from their actions is impossible unless you make specific assumptions about their rationality and epistemic beliefs about the state of the world, respectively. Paul Christiano went in detail on why this means that even the “easy” goal inference problem, meaning the (entirely fantastical and unrealistically optimistic set-up) in which we have access to an infinite amount of compute and to the “complete human policy (a lookup table of what a human would do after making any sequence of observations)”, and we must then come up with “any reasonable representation of any reasonable approximation to what that human wants,” is in fact tremendously hard.
Re the difference between beliefs and values, for AIXI/fixed value agents this is pretty easy, in that a value isn’t updatable by Bayesian reasoning about the world, and in particular it doesn’t update it’s value system in response to moral arguments, and arbitrarily competent/compute-rich agents can have very different values from you, but not arbitrarily different beliefs that aren’t caused by both of you being in different universes/worlds/situations.
For changeable value agents like us, and which I pointed out above to learned utility functions, this list might help:
You don’t favor shorter long-list definitions of goodness over longer ones. The criteria for choosing the list have little to do with its length, and more with what a human brain emulation with such-and-such modifications to make it believe only and all relevant true empirical facts would decide once it had reached reflective moral equilibrium.
Agents who have a different “long list” definition cannot be moved by the fact that you’ve declared your particular long list “true goodness”.
There would be no reason to expect alien races to have discovered the same long list defining “true goodness” as you.
An alien with a different “long list” than you, upon learning the causal reasons for the particular long list you have, is not going to change their long list to be more like yours.
You don’t need to use probabilities and update your long list in response to evidence, quite the opposite, you want it to remain changed only in specific circumstances that are set by you (edited from original)
On the easy goal inference problem, if the no-free lunch result on value learning is proved like how no-free-lunch theorems are usually proved in machine learning, then it’s not a blocker under the infinite compute affordance, since you can consider all 2^S (S is for set here) options on what the human values are by brute-force search/uniform search, and stop once you have exhausted all possible inputs/options.
I’d almost call no-free lunch theorems inapproximability results combined with computational complexity results, in that if you can make no assumptions, no algorithm performs better than brute-force search/uniform search over all possible inputs/options, and you either perfectly learn a look-up table in the general case, or you don’t learn the thing you want to learn at all.
However, the issue is that there is no complexity bound on how complicated someone’s values are in the general case, so I was definitely relying on the infinite compute here, which can’t be used for our situation.
Maybe beliefs and values can be more or less unified in special cases, though I doubt that will happen.
For the differences between humans, I’ll answer that question below:
Joe Carlsmith again, in his outstanding description of “An even deeper atheism”, questioned “how different, exactly, are human hearts from each other? And in particular: are they sufficiently different that, when they foom, and even “on reflection,” they don’t end up pointing in exactly the same direction?”, explaining that optimism about this type of “human alignment” is contingent on the “claim that most humans will converge, on reflection, to sufficiently similar values that their utility functions won’t be “fragile” relative to each other.” But Joe then keyed in on the crucial point that “while it’s true that humans have various important similarities to each other (bodies, genes, cognitive architectures, acculturation processes) that do not apply to the AI case, nothing has yet been said to show that these similarities are enough to overcome the “extremal Goodhart” argument for value fragility”, mentioning that when we “systematize” and “amp them up to foom”, human desires decohere significantly. (This is the same point that Scott Alexander made in his classic post on the tails coming apart.) Ultimately, the essay concludes by claiming that it’s perfectly plausible for most humans to be “paperclippers relative to each other [in the supposed reflective limit]”, which is a position of “yet-deeper atheism” that goes beyond Eliezer’s unjustified humanistic trust in human hearts.
I basically agree with this, and one of the more important effects of AI very deep into takeoff is that we will start realizing that a lot of human alignment relied on the fact that people were dependent on each other, and that a person is dependent on society, so societal coercion like laws/police mostly work, which AI more or less breaks, and there is no reason to assume that a lot of people wouldn’t be paper-clippers relative to each other if they didn’t need society.
To be clear, I still expect some level of cooperation, due to the existence of very altruistic people, but yeah the reduction of positive sum trades between different values, combined with a lot of our value systems only tolerating other value systems in contexts where we need other people will make our future surprisingly dark compared to what people usually think due to “most humans being paperclippers relative to each other [in the supposed reflective limit]”.
One reason I think Eliezer got this wrong is as you stated, where he puts too much trust in human hearts, but one other reason I think he got this wrong is that he treated AI risk from an AI that kills everyone due to misalignment as a very high probability threat, and incorrectly assumed that it doesn’t matter which humans, and which human values get control over AI, because of the assumption of psychological unity of humankind in values.
In essence, I think that politics matter way more than Eliezer does for how much the future is valuable to you, and political fights, while not great, are unfortunately more necessary than you think, and I disagree with this quote’s attitude:
Now: let’s be clear, the AI risk folks have heard this sort of question before. “Ah, but aligned with whom?” Very deep. And the Yudkowskians respond with frustration. “I just told you that we’re all about to be killed, and your mind goes to monkey politics? You’re fighting over the poisoned banana!”
To address computationalism and indexical values for a bit, here’s my answer:
In any case, the rather abstract “beliefs, memories and values” you solely purport to value fit the category of professed ego-syntonic morals much more so than the category of what actually motivates and generates human behavior, as Steven Byrnes explained in an expectedly outstanding way:
An important observation here is that professed goals and values, much more than actions, tend to be disproportionately determined by whether things are ego-syntonic or -dystonic. Consider: If I say something out loud (or to myself) (e.g. “I’m gonna quit smoking” or “I care about my family”), the actual immediate thought in my head was mainly “I’m going to perform this particular speech act”. It’s the valence of that thought which determines whether we speak those words or not. And the self-reflective aspects of that thought are very salient, because speaking entails thinking about how your words will be received by the listener. By contrast, the contents of that proclamation—actually quitting smoking, or actually caring about my family—are both less salient and less immediate, taking place in some indeterminate future (see time-discounting). So the net valence of the speech act probably contains a large valence contribution from the self-reflective aspects of quitting smoking, and a small valence contribution from the more direct sensory and other consequences of quitting smoking, or caring about my family. And this is true even if we are 100% sincere in our intention to follow through with what we say. (See also Approving reinforces low-effort behaviors, a blog post making a similar point as this paragraph.)
[...]
According to this definition, “values” are likely to consist of very nice-sounding, socially-approved, and ego-syntonic things like “taking care of my family and friends”, “making the world a better place”, and so on.
Also according to this definition, “values” can potentially have precious little influence on someone’s behavior. In this (extremely common) case, I would say “I guess this person’s desires are different from his values. Oh well, no surprise there.”
Indeed, I think it’s totally normal for someone whose “values” include “being a good friend” will actually be a bad friend. So does this “value” have any implications at all? Yes!! I would expect that, in this situation, the person would either feel bad about the fact that they were a bad friend, or deny that they were a bad friend, or fail to think about the question at all, or come up with some other excuse for their behavior. If none of those things happened, then (and only then) would I say that “being a good friend” is not in fact one of their “values”, and if they stated otherwise, then they were lying or confused.
I agree this is a plausible motivation, but one confound that applies here is that all discussions of uploading and whether it preserves you are fundamentally stalled by the fact that we don’t have anything close to the classical uploading machines, so you have to discuss things pretty abstractly, and we don’t have good terminology for this, which is why I’d prefer to punt this discussion until we have the technology.
Steve also argues, in my view correctly, that “all valence ultimately flows, directly or indirectly, from innate drives”, which are entirely centered on (indexical, selfish) subjective experience such as pain, hunger, status drive, emotions etc. I see no clear causal mechanism through which something like that could ever make a human (copy) stop valuing its qualia in favor of the abstract concepts you purport to defend.
I disagree with the universal quantifier here, and think that non-innate values can also contribute to valence.
I agree innate drives are a useful starting point, but I don’t buy the completeness of innate drives to contributing what you value, so I do think that non-indexical values can exist in humans.
More importantly, you can turn the valuing of experiences like status drive/emotions non-indexical if you modify the mind such that it always values a certain experience equally, no matter it’s copies, and more generally one of the changes I expect re identity and values amongst a lot of uploaded humans is to treat their values much less indexically, and to treat their identity as closer to an isomorphism/equivalence class of programs like their source code, rather than thinking in an instance focused way.
A big reason for this is because of model merging, which is applicable to current AIs could plausibly be used on uploaded humans as well, and this allows you to unify goals in a satisfying manner across copies (which is one of the reasons why AIs will in practice be closer to a single big being than billions of little beings, even if you could split them up into billions of little instances of the AIs, because this sort of model merging wouldn’t work on AIs that had strongly indexical goals like us, and this technology will incentivize non-indexical goals).
Also, I expect uploads to have more plasticity than current human brains, as well.
As a general matter, accepting physicalism as correct would naturally lead one to the conclusion that what runs on top of the physical substrate works on the basis of… what is physically there (which, to the best of our current understanding, can be represented through Quantum Mechanical probability amplitudes), not what conclusions you draw from a mathematical model that abstracts away quantum randomness in favor of a classical picture, the entire brain structure in favor of (a slightly augmented version of) its connectome, and the entire chemical make-up of it in favor of its electrical connections. As I have mentioned, that is a mere model that represents a very lossy compression of what is going on; it is not the same as the real thing, and conflating the two is an error that has been going on here for far too long. Of course, it very well might be the case that Rob and the computationalists are right about these issues, but the explanation up to now should make it clear why it is on them to provide evidence for their conclusion.
I have a number of responses:
Quantum physics can be represented by a computation, since almost everything is representable by a computation as shown below, because the computationalist ontology is very, very expressive, and one reason why philosophical debates on computationalism go nowhere is because people don’t realize how expressive the computationalist frame work is, but because of this very expressivity, the computationalist ontology often buys you no predictions unless you are more specific.
More importantly, classical computers can always faithfully simulate a quantum system like humans given enough time, because quantum computers are no stronger than classical ones, so in this regard the map and territory match well enough.
So a computationalist view of how things work, including human minds is entirely compatible with physicalism/believing in Quantum Mechanics.
Finally, and to get to the actual crux here, my crux is that while I do think the classical brain picture where it is a classical connectome is a lossy model, I don’t think it’s so lossy as to ruin the chances of human uploading being possible, and indeed I’d argue given recent AI evidence that a surprisingly large amount of what makes our human brain special can be replicated by very different entities/very different substrates, which is indirect/weak empirical evidence in favor of uploading being possible.
More specifically, is a real-world being actually the same as the abstract computation its mind embodies? Rejections of souls and dualism, alongside arguments for physicalism, do not prove the computationalist thesis to be correct, as physicalism-without-computationalism is not only possible but also (as the very name implies) a priori far more faithful to the standard physicalist worldview.
Which is that the answer is in a sense trivially yes, a real-world being is the same as at least one abstract computation, solely due to the algorithmic ontology being strictly more expressive than the physicalist ontology, and that physicalist worldviews are also compatible with computationalist worldviews.
One key thing that helps us out is that from a MWI/infinite universe scenario is assuming our locally observable/affectable universe isn’t special in how atoms clump up into bigger structures, which is pretty likely, then every possible (according to the laws of physics) combination of atoms will be tried in an infinite universe, and thus we can sensibly define the notion of possible worlds even in a deterministic universe/MWI multiverse, as all possible combinations allowed by the laws of physics will be done, and you can separate the possible worlds from a very fine-grained perspective (if you could generate the entire universe yourself), so in theory possible worlds work out.
Thus, we don’t need to translate in theory between the map and territory, but in practice, we would like ways for translating human preferences that are in the map, into preferences that correspond to the territory of reality.
On human values, I’d predict that current human values combine both indexical parts referring to particular contexts, but also have some abstract values that are not indexical and are essentially context-free/are invariant to copying scenarios. Justice/freedom seems likely to be one such value (for a non-trivial number of humans).
However, I don’t know the structural assumptions that are required in order to make the question “what does the human actually want?” a well defined question under realistic constraints on compute.
If we allow unrealistic models of computation like a Turing Machine that is allowed to have infinitely many states, or a Blum-Shub-Smale machine, then it’s easy to both make the question well defined and make the question answerable by these machines even under the no-free lunch theorems, because we have at most 2^N (N here refers to all of the natural numbers including 0) possibilities, which are all computable by the following models of computation above.
I definitely could agree with something like a claim that humans are closer to control processes than agents, or at least that the basic paradigm shouldn’t be agents but something else, but for our purposes, I don’t think we need a clean mathematical operationalization of what a powerful agent is in order for alignment to succeed in a practical sense.
And that is the end of the very long comment. I am fine with no response or with an incomplete response, but these are my thought on the very interesting questions you raise.
I want to echo Jonas’s statement and say that this was an enjoyable and thought-provoking comment to read. I appreciate the deep engagement with the questions I posed and the work that went into everything you wrote. Strong-upvoted!
I will not write a point-by-point response right now, but perhaps I will sometime soon, depending on when I get some free time. We could maybe do a dialogue about this at some point too, if you’re willing, but I’m not sure when I would be up for that just yet.
Randomly read this comment and I really enjoyed it, Turn it into a post? (I understand how annoying structuring complex thoughts coherently can be but maybe do a dialogue or something? I liked this.)
I largely agree with a lot of the missing things in people’s views of utility functions and so I think you expressed some of that in a pretty good deeper way.
When we get into acausality and evertt branches I think we’re going a bit off-track. I can think computational intractability and observer bias is something interesting to bring up but I always find it never leads anywhere. Quantum Mechanics is fundamentally observer invariant and so positing something like MWI is a philosophical stance (that is supported by occam’s razor) but it is still observer dependent, what if there are no observers?
Randomly read this comment and I really enjoyed it, Turn it into a post? (I understand how annoying structuring complex thoughts coherently can be but maybe do a dialogue or something? I liked this.)
Maybe I should try a dialogue with someone else on this, because I don’t think any of my points are very extendible to a full post without someone helping me.
Do you have any specific reason why you’re going into QMech when talking about brain-like AGI stuff?
To be frank, this was mostly about clarifying the philosophy around computationalism/human values in general, but I didn’t go that deep into QMech for brain-like AGI and don’t expect it to be immediately useful for my pursuits, so the only role for QMech here is in clarifying some confusions people have, and QMech wasn’t even that necessary to make my points.
When we get into acausality and evertt branches I think we’re going a bit off-track. I can think computational intractability and observer bias is something interesting to bring up but I always find it never leads anywhere. Quantum Mechanics is fundamentally observer invariant and so positing something like MWI is a philosophical stance (that is supported by occam’s razor) but it is still observer dependent, what if there are no observers?
Okay, the thing I think you are pointing to is that the same outcomes/rules can be generated out of ontologically distinct interpretations, and for our purposes, the observer is basically anything that interacts with anything, whether it’s a human or particle, and thus saying there are no observers corresponds to saying that there is nothing in the universe, including the forces, and in particular dark energy is exactly 0.
The answer is that it would be a very different universe than our universe is today.
It’s looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky.
I’ve missed this. Any particular link to get to me started reading about this update? Shard theory seems to imply complex values in individual humans. Though certainly less fragile than Yudkowsky proposed.
(Huh, I never saw this—maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)
I really appreciate these thoughts!
I would say “that isn’t how on-policy RL works; it doesn’t just intelligently find increasingly high-reinforcement policies; which reinforcement events get ‘exploited’ depends on the exploration policy.” (You seem to guess that this is my response in the next sub-bullets.)
shrug, too good to be true isn’t a causal reason for it to not work, of course, and I don’t see something suspicious in the correlations. Effective learning algorithms may indeed have nice properties we want, especially if some humans have those same nice properties due to their own effective learning algorithms!
For my money, the nice properties that human and AI systems have that matter for alignment is IMO not the properties from Shard Theory, but rather several other properties that mattered:
Alignment generalizes further than capabilities because of verifying being easier to generate, as well as learning values being easier than having a lot of other real world capabilities.
It’s looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago.
The brain and DL AIs, while not the same thing, are doing reasonably similar things such that we can transport a lot of AI insights into neuroscience/human brain insights, and vice versa.
One of those lessons is the bitter lesson from Sutton applies to human values and morals, which cashes out into the fact that the data matter much more than the algorithm when predicting it’s values, especially OOD generalization of values, and thus controlling the data is basically equivalent to controlling the values.
I’m not sure I like this argument very much, as it currently stands. It’s not that I believe anything you wrote in this paragraph is wrong per se, but more like this misses the mark a bit in terms of framing.
Yudkowsky had (and, AFAICT, still has) a specific theory of human values in terms of what they mean in a reductionist framework, where it makes sense (and is rather natural) to think of (approximate) utility functions of humans and of Coherent Extrapolated Volition as things-that-exist-in-the-territory.
I think a lot of writing and analysis, summarized by me here, has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe “human values” in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints; I instead observe a tremendous number of unanswered questions and apparent contradictions that throw the entire edifice in disarray.
But supplementing this reorientation of thinking around what it means to satisfy human values has been “prosaic” alignment researchers pivoting more towards intent alignment as opposed to doomed-from-the-start paradigms like “learning the true human utility function” or ambitious value learning, a recognition that realism about (AGI) rationality is likely just straight-up false and that the very specific set of conclusions MIRI-clustered alignment researchers have reached about what AGI cognition will be like are entirely overconfident and seem contradicted by our modern observations of LLMs, and ultimately an increased focus on the basic observation that full value alignment simply is not required for a good AI outcome (or at the very least to prevent AI takeover). So it’s not so much that human values (to the extent such a thing makes sense) are simpler, but more so that fulfilling those values is just not needed to nearly as high a degree as people used to think.
Here’s my thoughts on these interesting questions that you raise:
IMO, I don’t think Coherent Extrapolated Volition works, basically because I don’t expect convergence in values by default, and I agree with Steven Byrnes plus Joe Carlsmith here:
https://joecarlsmith.com/2021/06/21/on-the-limits-of-idealized-values
https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_3_Possible_implications_for_AI_alignment_discourse
That said, I think the approximate utility function framing is actually correct, in that the GPT series (and maybe o1/o3 too) does have a utility function that’s about prediction, and we can validly turn utility functions over plans/predictions into utility functions over world states, so we can connect two different types of utility functions together, and I have commented on this before:
https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform#aCFCrRDALk3DMNkzh
https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform#gjE9eDiAZvzKxcgSs
More generally, I have more faith than that the utility function paradigm can be reformed significantly without wholly abandoning it.
I also think to the extent human values do work, it’s at a higher level than reductionism would posit, but it is built out of lower-level parts that are changable.
To handle the unanswered questions, I’ll first handle this one:
I agree with the path dependency argument that morality is more path dependent than LWers think, and while controlling your value evolution is easier than predicting it, I basically agree with the claim that CEV probably makes less sense/isn’t unique.
I think the issue of the fact that that the feedback loops update the brain significantly, causing the computation graph to change and thus complicating updateleness for uploaded/embodied humans significantly is an actually real problem, and a big reason why the early updateless decision theories didn’t matter that much is because they assumed omniscience in a logical and computational sense, so it doesn’t work for real humans.
I’m not sure if this is unsolvable, and I wouldn’t say that there won’t be a satisfying implementation of UDT in the future, but yeah I would not yet bet that much on updatelessness working out very well.
(Heck, even Solomonoff induction/AIXI isn’t logically/computationally omniscient, if only because they also can’t compute certain functions that more powerful computers can)
Vladimir Nesov describes it more here:
https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform#s4cTgQZNpWRLKp3EG
https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform#fdDfad5s8cS5kumEf
I am not sure whether any version of updateleness in decision theory can survive realistic constraints, so I don’t know whether it is solvable, and I don’t think it matters for now.
I basically agree with this, and I think the likely wrong assumption here is that an AI terminal value can be treated as fixed, and IMO the big issue I have with a lot of LW content on values is precisely that they treat a utility function of a human (when the utility function framing works, which does work sometimes but also doesn’t work other times) as fixed.
I think this post might help you on how a more realistic version of utility functions would work, which include the fact that terminal values change:
https://www.lesswrong.com/posts/RorXWkriXwErvJtvn/agi-will-have-learnt-utility-functions
On Charlie Steiner’s view of Goodhart, quoted below, I have a pretty simple response:
The answer to this is basically John Wentworth’s comment, repeated here:
https://www.lesswrong.com/posts/gQY6LrTWJNkTv8YJR/the-pointers-problem-human-values-are-a-function-of-humans#Ar87Jkeg8TzSraLcD
More generally, solutions to these sorts of problems come down to the fact that we can make a new abstract layer out of atomic parts, and use error-correction to make the abstraction as non-leaky as possible.
On what are human preferences, I’ll state my answer below:
To answer the question of whether we have good reason to expect utility maximization as a description of what real AI looks like, my modern answer is “Because this happened at least 2 times, and thus provides limited but useful evidence on whether utility maximization will appear.”
To answer the question of what counts as human preferences, I think all of the answers have some correctness, and human preferences are both over world states, universe histories and also plausibly have some values that aren’t reducible to the utility function view.
An important point about utility functions/coherence arguments is that for the purposes of coherence theorems/utility functions, we only care about the revealed preferences/behaviors, and thus we only need to observe their behavior to check whether or not they have a utility function that is coherent, not whether a utility function is implemented truly inside it’s head:
https://www.lesswrong.com/posts/DXxEp3QWzeiyPMM3y/a-simple-toy-coherence-theorem#Coherence_Is_About_Revealed_Preferences
https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems#ddSmggkynaAHmFuHi
This matters for debates like are humans coherent or not.
To answer @Wei Dai’s question here:
I think the ultimate answer to this is to reject the assumption that values are fixed for all time, no matter which arbitrary environment is used, and instead focus on learned utility functions.
I think Wei Dai is pointing at a pretty real and deep problem in how LWers think about utility functions/values, downstream of making the AIXI model of intelligence the dominant form of thought of how AI was likely to end up if it achieved ASI, which has totally fixed values in the form of a reward function it optimizes for all time, but contra Wei Dai and Rohin Shah, I think that it doesn’t doom the ambitious value learning agenda based on utility functions, since not all utility functions are non-responsive to the environment.
https://www.lesswrong.com/posts/RorXWkriXwErvJtvn/agi-will-have-learnt-utility-functions
On this:
This sounds like you are claiming that there are good reasons to believe that the pointers problem is fundamentally impossible to solve, at least in the general case.
I’ll say a few things that are relevant here:
I don’t particularly think this matters for AI x-risk in general, and thus I will mostly punt on this question.
I think AI progress is weak evidence that something like the pointers problem is possible to solve in theory, but not that strong.
I don’t particularly think we need to solve/prove impossible today, and can defer to our future selves on this question.
https://www.lesswrong.com/posts/Mha5GA5BfWcpf2jHC/potential-bottlenecks-to-taking-over-the-world#XFSZNWrHANdXhoTcT
I feel similarly for realism about rationality, but I’d drop my second point.
I definitely agree that full value alignment is not required for humans to thrive in a world where AIs control the economy, and this was not appreciated well enough by a lot of doomy people, primarily stemming from over-assuming values are fragile combined with assuming AIs would basically instantly takeover because of assuming that AI would FOOM from today’s intelligence to superintelligence, and while I do genuinely think that human values are simpler than Yudkowsky thought, the observation that full alignment is not required for AI safety is an underrated insight.
This insight is compatible with a world where human values are genuinely simpler than we thought.
Re the difference between beliefs and values, for AIXI/fixed value agents this is pretty easy, in that a value isn’t updatable by Bayesian reasoning about the world, and in particular it doesn’t update it’s value system in response to moral arguments, and arbitrarily competent/compute-rich agents can have very different values from you, but not arbitrarily different beliefs that aren’t caused by both of you being in different universes/worlds/situations.
For changeable value agents like us, and which I pointed out above to learned utility functions, this list might help:
On the easy goal inference problem, if the no-free lunch result on value learning is proved like how no-free-lunch theorems are usually proved in machine learning, then it’s not a blocker under the infinite compute affordance, since you can consider all 2^S (S is for set here) options on what the human values are by brute-force search/uniform search, and stop once you have exhausted all possible inputs/options.
I’d almost call no-free lunch theorems inapproximability results combined with computational complexity results, in that if you can make no assumptions, no algorithm performs better than brute-force search/uniform search over all possible inputs/options, and you either perfectly learn a look-up table in the general case, or you don’t learn the thing you want to learn at all.
I also agree with davidad here:
https://www.lesswrong.com/posts/yTvBSFrXhZfL8vr5a/worst-case-thinking-in-ai-alignment#N3avtTM3ESH4KHmfN
However, the issue is that there is no complexity bound on how complicated someone’s values are in the general case, so I was definitely relying on the infinite compute here, which can’t be used for our situation.
Maybe beliefs and values can be more or less unified in special cases, though I doubt that will happen.
For the differences between humans, I’ll answer that question below:
I basically agree with this, and one of the more important effects of AI very deep into takeoff is that we will start realizing that a lot of human alignment relied on the fact that people were dependent on each other, and that a person is dependent on society, so societal coercion like laws/police mostly work, which AI more or less breaks, and there is no reason to assume that a lot of people wouldn’t be paper-clippers relative to each other if they didn’t need society.
To be clear, I still expect some level of cooperation, due to the existence of very altruistic people, but yeah the reduction of positive sum trades between different values, combined with a lot of our value systems only tolerating other value systems in contexts where we need other people will make our future surprisingly dark compared to what people usually think due to “most humans being paperclippers relative to each other [in the supposed reflective limit]”.
One reason I think Eliezer got this wrong is as you stated, where he puts too much trust in human hearts, but one other reason I think he got this wrong is that he treated AI risk from an AI that kills everyone due to misalignment as a very high probability threat, and incorrectly assumed that it doesn’t matter which humans, and which human values get control over AI, because of the assumption of psychological unity of humankind in values.
In essence, I think that politics matter way more than Eliezer does for how much the future is valuable to you, and political fights, while not great, are unfortunately more necessary than you think, and I disagree with this quote’s attitude:
To address computationalism and indexical values for a bit, here’s my answer:
I agree this is a plausible motivation, but one confound that applies here is that all discussions of uploading and whether it preserves you are fundamentally stalled by the fact that we don’t have anything close to the classical uploading machines, so you have to discuss things pretty abstractly, and we don’t have good terminology for this, which is why I’d prefer to punt this discussion until we have the technology.
I disagree with the universal quantifier here, and think that non-innate values can also contribute to valence.
I agree innate drives are a useful starting point, but I don’t buy the completeness of innate drives to contributing what you value, so I do think that non-indexical values can exist in humans.
More importantly, you can turn the valuing of experiences like status drive/emotions non-indexical if you modify the mind such that it always values a certain experience equally, no matter it’s copies, and more generally one of the changes I expect re identity and values amongst a lot of uploaded humans is to treat their values much less indexically, and to treat their identity as closer to an isomorphism/equivalence class of programs like their source code, rather than thinking in an instance focused way.
A big reason for this is because of model merging, which is applicable to current AIs could plausibly be used on uploaded humans as well, and this allows you to unify goals in a satisfying manner across copies (which is one of the reasons why AIs will in practice be closer to a single big being than billions of little beings, even if you could split them up into billions of little instances of the AIs, because this sort of model merging wouldn’t work on AIs that had strongly indexical goals like us, and this technology will incentivize non-indexical goals).
More below:
https://minihf.com/posts/2024-11-30-predictable-updates-about-identity/
Also, I expect uploads to have more plasticity than current human brains, as well.
I have a number of responses:
Quantum physics can be represented by a computation, since almost everything is representable by a computation as shown below, because the computationalist ontology is very, very expressive, and one reason why philosophical debates on computationalism go nowhere is because people don’t realize how expressive the computationalist frame work is, but because of this very expressivity, the computationalist ontology often buys you no predictions unless you are more specific.
More here:
http://www.amirrorclear.net/academic/ideas/simulation/index.html
More importantly, classical computers can always faithfully simulate a quantum system like humans given enough time, because quantum computers are no stronger than classical ones, so in this regard the map and territory match well enough.
So a computationalist view of how things work, including human minds is entirely compatible with physicalism/believing in Quantum Mechanics.
Finally, and to get to the actual crux here, my crux is that while I do think the classical brain picture where it is a classical connectome is a lossy model, I don’t think it’s so lossy as to ruin the chances of human uploading being possible, and indeed I’d argue given recent AI evidence that a surprisingly large amount of what makes our human brain special can be replicated by very different entities/very different substrates, which is indirect/weak empirical evidence in favor of uploading being possible.
The source is below:
https://minihf.com/posts/2024-11-30-predictable-updates-about-identity/
This also answers this portion below:
Which is that the answer is in a sense trivially yes, a real-world being is the same as at least one abstract computation, solely due to the algorithmic ontology being strictly more expressive than the physicalist ontology, and that physicalist worldviews are also compatible with computationalist worldviews.
On human preferences:
One key thing that helps us out is that from a MWI/infinite universe scenario is assuming our locally observable/affectable universe isn’t special in how atoms clump up into bigger structures, which is pretty likely, then every possible (according to the laws of physics) combination of atoms will be tried in an infinite universe, and thus we can sensibly define the notion of possible worlds even in a deterministic universe/MWI multiverse, as all possible combinations allowed by the laws of physics will be done, and you can separate the possible worlds from a very fine-grained perspective (if you could generate the entire universe yourself), so in theory possible worlds work out.
Thus, we don’t need to translate in theory between the map and territory, but in practice, we would like ways for translating human preferences that are in the map, into preferences that correspond to the territory of reality.
On human values, I’d predict that current human values combine both indexical parts referring to particular contexts, but also have some abstract values that are not indexical and are essentially context-free/are invariant to copying scenarios. Justice/freedom seems likely to be one such value (for a non-trivial number of humans).
However, I don’t know the structural assumptions that are required in order to make the question “what does the human actually want?” a well defined question under realistic constraints on compute.
If we allow unrealistic models of computation like a Turing Machine that is allowed to have infinitely many states, or a Blum-Shub-Smale machine, then it’s easy to both make the question well defined and make the question answerable by these machines even under the no-free lunch theorems, because we have at most 2^N (N here refers to all of the natural numbers including 0) possibilities, which are all computable by the following models of computation above.
On agents:
I definitely could agree with something like a claim that humans are closer to control processes than agents, or at least that the basic paradigm shouldn’t be agents but something else, but for our purposes, I don’t think we need a clean mathematical operationalization of what a powerful agent is in order for alignment to succeed in a practical sense.
And that is the end of the very long comment. I am fine with no response or with an incomplete response, but these are my thought on the very interesting questions you raise.
I want to echo Jonas’s statement and say that this was an enjoyable and thought-provoking comment to read. I appreciate the deep engagement with the questions I posed and the work that went into everything you wrote. Strong-upvoted!
I will not write a point-by-point response right now, but perhaps I will sometime soon, depending on when I get some free time. We could maybe do a dialogue about this at some point too, if you’re willing, but I’m not sure when I would be up for that just yet.
I am willing to do a dialogue, if you are interested @sunwillrise.
Randomly read this comment and I really enjoyed it, Turn it into a post? (I understand how annoying structuring complex thoughts coherently can be but maybe do a dialogue or something? I liked this.)
I largely agree with a lot of the missing things in people’s views of utility functions and so I think you expressed some of that in a pretty good deeper way.
When we get into acausality and evertt branches I think we’re going a bit off-track. I can think computational intractability and observer bias is something interesting to bring up but I always find it never leads anywhere. Quantum Mechanics is fundamentally observer invariant and so positing something like MWI is a philosophical stance (that is supported by occam’s razor) but it is still observer dependent, what if there are no observers?
(Pointing at Physics as Information Processing)
Do you have any specific reason why you’re going into QMech when talking about brain-like AGI stuff?
Maybe I should try a dialogue with someone else on this, because I don’t think any of my points are very extendible to a full post without someone helping me.
To be frank, this was mostly about clarifying the philosophy around computationalism/human values in general, but I didn’t go that deep into QMech for brain-like AGI and don’t expect it to be immediately useful for my pursuits, so the only role for QMech here is in clarifying some confusions people have, and QMech wasn’t even that necessary to make my points.
Okay, the thing I think you are pointing to is that the same outcomes/rules can be generated out of ontologically distinct interpretations, and for our purposes, the observer is basically anything that interacts with anything, whether it’s a human or particle, and thus saying there are no observers corresponds to saying that there is nothing in the universe, including the forces, and in particular dark energy is exactly 0.
The answer is that it would be a very different universe than our universe is today.
I’ve missed this. Any particular link to get to me started reading about this update? Shard theory seems to imply complex values in individual humans. Though certainly less fragile than Yudkowsky proposed.
Note, this is outside of Shard Theory’s scope, and I wasn’t appealing to shard theory here.
So the links that I personally viewed to make these updates are here:
This summary of Matthew Barnett’s post:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
And 2 links from Beren about alignment:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/