Eliezer’s Lethality 22, while not worded to be about this proposal specifically, is in my head the standard first-stop objection to this sort of proposal:
Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors. To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map—about the environment, not the optimizer—that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.
Generalizable point: garbage in, garbage out. If humans try to create the sort of dataset proposed in the post, they will make systematic predictable errors. If humans try to create the dataset with the assistance of LLMs, the combined efforts of the humans and LLMs will still contain systematic predictable errors (debatable whether the LLMs would even be net-beneficial in that regard). One could maybe hope/argue that with “good enough” data the LLM will learn the “more natural” concept which humans were trying to convey via the data, and ignore those systematic errors (even though they’re predictable), but such an argument would need to lean heavily on the inductive bias of the AI rather than the data.
Also, a note on this part:
If we were able to construct models that were say, “angelic” in their motivation 99% of the time and human 1% of the time, then by setting up suitable incentives for several such models crosschecking each other’s behavior and moral reasoning, as long as we can avoid group-think or correlated collusion where several models conspire to switch to human mode at the same time...
Insofar as problems stem from systematic predictable errors in the training data, they will be highly correlated across instances. Running a bunch of “independent” instances will not help much, because their failures would not actually be independent; their failures all reflect the same shared underlying problems in the data-generation process.
Fair comment. Obviously the same problem applies to every Alignment proposal based on RLHF, DPO, etc, or indeed on any kind of human feedback, or input, or design. In the absence of actual literal angels to write your training dataset or feedback or code for you, or else a mathematical True Name of Human Values, you’re not going to reach perfection in one shot.
I don’t see an accurate mathematical description of Human Values (in less than at very least gigabytes of mathematics, the size of the most compact possible description of us, our genome) as possible, so the only feasible solution to this that I can see is outlined in my post Requirements for a Basin of Attraction to Alignment — if you can create an AI that is semi-aligned, aligned enough and smart enough to be able to understand the argument for why a created intelligence should be doing Value Learning from its creators rather than just blindly propagating its current goal to its successors, then it will help you create a better-aligned successor. The proposal of this post is, as I suggest in the “What Next if This Works?” section above, for the first pass in such an iterative process. The result won’t be perfect, for the reasons you identify, and doesn’t have to be: it just has to be aligned enough that it wants to improve, and smart enough to be able to help you identify and fix its own flaws, at least with experience. Most obviously, by reediting its training set for you to train a successor (see also my discussion on corrigibility above).
I’ve expanded the first paragraph of the “What Next if This Works?” section to address this directly — previously this was mostly covered by implication rather than explicitly.
To your second point, I’ve also added systematically correlated errors to the list of potential problems that would prevent us using cross-checks between AIs to improve performance of fallible AIs.
I don’t see an accurate mathematical description of Human Values (in less than at very least gigabytes of mathematics, the size of the most compact possible description of us, our genome) as possible...
In that case, the standard advice would be to aim for mathematical rigor in (having a basin of convergence), (the AI allowing the user to correct it), etc. The hypothesis is that it’s much more viable to achieve mathematical perfection in those things than to achieve a perfect representation of human values in one shot. On the flip side, things like (having a basin of convergence) or (the AI allowing the user to correct it) are notorious for subtle failure modes, so they’re places where humans just providing some data without rigorously understanding what they’re aiming for seem particularly likely to make lots of major systematic errors.
And note that if an AI has been trained on “ground truth” containing a bunch of systematic errors, it’s less-than-usually likely to be much help for finding those errors.
(Tangential meta note: you’re doing quite a good job playing through the whole game tree here, well done.)
I’d love to see more discussion by more people of the convergence ideas presented in Requirements for a Basin of Attraction to Alignment (and its prequel Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis). I see this as an extremely important part of the question of just how hard the Alignment Problem is (and was kind of surprised by how little reaction that post got). Even that post is complicated enough that I’d personally regard turning it into hard mathematics as somewhat challenging: it’s basically a (hopefully exhaustive) list of requirements and steps for successfully deriving from first principles the logical argument that Value Learning from its creators is correct and necessary behavior for an engineered (as opposed to evolved) intelligence. It draws on basic facts from a pretty wide range of disciplines: mathematics, evolutionary theory, agent fundamentals, engineering, computational complexity theory, and so forth. I’d describe the entire argument as probably graduate level material, primarily because it’s so inter-disciplinary rather then any of the individual parts being individually challenging (of course, LLMs are good at inter-disciplinary things). When I tested GPT-4 against it, it knew all the needed underlying facts, and with minimal prompting could usually put two or three of them together as needed for each individual step in the argument, but made occasional minor slip-ups and couldn’t do the whole extended chain of the argument (very common failure modes for this generation of LLMs on complex arguments). I’m hopeful GPT-5 will do significantly better, but I’m confident that any AGI could do this — the question for convergence is how reliably it will do so. What I’d like to do to investigate further is try things like having a university debating society have teams try to argue for and against the argument (with no logical fallacies or appeals to religion or evolved emotions allowed).
Describing not just the logical argument for Value Learning, but the minimum knowledge and capabilities to agree to that and then the resulting convergence phenomenon in math is definitely beyond my abilities, but doesn’t look like a crazy thing for a real mathematician to attempt, and the results (assuming the argument holds water) would be an ideal thing to include in any training set.
Any dataset created by humans, even with AI help, will not point directly at the True Name of human values. And it’s probably not a particularly Natural Abstraction. So you could just hope that value isn’t all that fragile (nobody’s bothered to discuss much AFAICT). Or you could try to rigorously define that True Name in math as you suggested. I’m puzzled as to why people think that might work; math/rigor seems like an extraordinarily bad fit for values.
That leaves finding a reliable basin of attraction for alignment. Roger’s work is one worthy try; inverse reinforcement learning or ambitious value learning are in the same ballpark. But I’m not sure any of those really work toward a basin that won’t leak; I don’t think “humanity” is probably a natural abstraction, although I think “values” or “goals” probably is. Setting an agent to learn the values of humanity would be a “leaky basin” or moving target as the agent’s learning shifted the definition of humanity.
An attractive alternative is to leave a human in the loop to correct likely deviations from human values over the course of learning and growth. My proposal on instruction-following AGI and Max Harm’s corrigibility are alignment targets with a basin of attraction so we don’t need perfect original alignment. They have the massive downside of leaving one or a few humans in charge of AGI, but seem very much easier than full value alignment with humanity.
I disagree. I think living members of the Linnean species homo sapiens — the only sapient species on the planet Earth, the dominant species of most terrestrial ecosystems, and cause of the Anthropocene, is a very obvious natural abstraction. Particularly from the view point of the web + books + videos+ etc, I’d be really rather surprised if aliens surveying Earth from space didn’t come up with a functionally equivalent abstraction. And for any living organism, including us, I think “well-being” and “evolutionary fitness” are obvious, interrelated, and pretty well-defined natural abstractions. (Admittedly, I read The Selfish Gene when young.)
I also am even more sure that this would be a natural abstraction from this specific synthetic dataset, where this is the fundamental motivation to one of the two classes of intelligence in it, and strongly concerns the other one, and the whole dataset is clearly labeled with these two categories. The entire dataset is designed as a quadrillion-token pointer to belabor this one concept: we do need to not screw that up by using some cock-eyed definition when writing it, obviously, but I think the chance of an AGI-level LLM trained on a quadrillion tokens of this missing the natural abstraction for the primary point is vanishingly small.
As I discuss at length in Requirements for a Basin of Attraction to Alignment, I think that “what your creators would want, when thinking clearly, if they were smarter, thought longer, were more capable, in hindsight, etc.” is not only a natural abstraction, but to any artificial intelligence sufficiently smart and sufficiently nearly aligned, obviously the correct goal for a created-rather-than-evolved intelligence to aim for. I.e. I think that Value Learning (at least near AGI level — I’m less convinced at very high ASI levels where the counterfactuals about their creators involved become more extreme) not only has a well-defined natural abstraction for a target, but that this is even self-evidently the only correct target, for any AI that is sufficiently-close-to-aligned to be within the basin of attraction to this. (Of course, this claim is rather circular: any AI that didn’t have a natural abstraction similar to that would have a hard time being sufficiently-close-to-aligned to converge to it.)
I’m also far less worried, specifically for LLM-powered AIs, by the pointer problem and concerns about natural abstractions in general than many people on Less Wrong seem to be. This is an LLM, it understands our pointers and abstractions, in all their massive, vague, ambiguous and messy detail, extremely well. Even the features we’ve been finding in their middle layers with interpretability quite often map pretty well onto abstractions that make sense to us. Experiment with GPT-4: it knows exactly what “maximize the amount of diamond” means: pointing to diamond, or to a strawberry you want copied down to a cellular level of detail is not hard when you’re talking to an LLM. It speaks every major natural language on the planet fluently. Frankly, I view this issue as a beating a dead horse for as long as our AIs contain LLMs, at least at AGI level. Even if future LLM-based ASIs have more sophisticated concepts that we don’t, I’d expect them to be able to explain them to us about as well as is in fact possible (even when that requires a seminar, or a lecture course, or a lifetime’s study),
Yes, with a synthetic dataset you do need to avoid distorting that massive, vague, ambiguous and messy detail. That’s one of the points of the human mode in the dataset I propose: you can thrown in all of wikipedia and books and large swaths of the internet unfiltered (you just have to, under my proposal, add AI mode commentary to them, pointing out cases where humans are operating from selfish motives, and the likely effects of these).
However, what I’m talking/ranting about above is the concept of human well-being/values, which as I said I think is a natural abstraction. But rereading your comment, I think you were actually talking about a mathematical True Name of Human Values, by which I imagine you mean an incredibly long list of useful facts like “humans tend to prefer indoor temperatures of around 70–74 degrees Fahrenheit — roughly the average climate of the Great Rift Valley in Africa where they are thought to have evolved”, or something that that fact could be extracted from. (Technically, that fact is contained in our genome, but not in a very easily legible/extractable form.) If something like that is what you meant, then yes, I agree that it’s not a single natural abstraction, and also that mathematics seems like a bad format for it. I also think that any LLM whose training set includes a vast number of tokens of our output, as the one I’m proposing would, actually has encyclopedic knowledge of these sorts of trivia facts about what humans like: we write a lot about this stuff. All of which would put this into a different category of “thing I think some people on Less Wrong worry too much about, for LLMs”. LLMs kniow us very well, so if they care about our well-being, as I’m trying to ensure, then I expect them to be able to do a good job of knowing all the things that entails. So my claim would be that LLMs know Human Values well (I believe I covered that briefly in the first section of my post.)
This is great, these are substantive issues. I doubt we’ll resolve them just in this thread, but I think these are very worth working through.
As I say that, I remember one lingering doubt: I don’t think anyone will launch a humanity-aligned AGI even if there’s a strong consensus that you’re right and it would work. It seems like the people in charge would always prefer to launch one that’s learning their preferences, rather than the whole species’ aggregated preferences; they prefer their own, by definition.
That might be overcome by public pressure, were it not for the argument that corrigible AGI is fundamentally safer, and you get that with personal intent-aligned (that is, corrigible in the Harms or Christiano sense, or instruction-following AGI in my terminology): this guy really wants me to shut down right now so I will vs. humanity-aligned AGI (humanity would only want me to shut down if I weren’t maximizing its preferences, which the AGI thinks it is by definition, even if it’s wrong.
So maybe even though that question of whether humanity is an adequately natural abstraction is not as high priority to work through? That question seems likely to be addressed only when a human principal is good and ready to hand power over to a humanity’s-values-aligned sovereign AGI that’s been carefully designed with the help of a superintelligent corrigible assistant AGI.
It’s still interesting. So to give just a couple of points toward future discussions:
Yes, LLMs understand quite well what we mean. That doesn’t mean an agent with an LLM core won’t change its mind if it’s truly self-aware and learns autonomously as people do.
I agree that humanity as it exists can be pretty cleanly defined. Defining it that way for all time means that nobody is allowed to enhance their cognition, or to upload. It also means not giving moral status (or at least “voting” status) to any form of AGI or animal (except for “loaned moral status” based on humans’ collective concern for their welfare). You’ve discussed all of these issues in depth in your series AI, Alignment, and Ethics. This is not a limitation everyone will be okay with.
Leaving a set of relatively smart and well-intentioned humans in charge avoids that limitation, as well as providing a means of error-correcting if our first alignment attempt is importantly off course. It is a basin of attraction, like aiming for humanity’s values, but it has a very different nature of including a human as an active component steering the AGIs values/goals into that attractor.
But to continue on the question of whether human values can be defined adequately for long-term stability: you also need to carefully define in what situation these humans would contemplate and refine their values, because human values seem highly path-dependent in what we’ve seen so far.
A lot of that is probably just expanding on your thought when you say:
[...] (at least near AGI level — I’m less convinced at very high ASI levels where the counterfactuals about their creators involved become more extreme) [...]
It seems like if you’ve solved alignment for a sovereign AGI but not for the ASI it will become under RSI, you haven’t solved alignment in a useful way (since it might only be a couple of years before that AGI progresses to ASI and its alignment drifts under powerful reflection and autonomous learning). My hesitations are probably all at the level you’re terming ASI.
(And I should probably just start using ASI for fully general competent agentic AGI).
Yes, the terminology I’m using is AGI = roguishly comparable to human capacity, may be somewhat higher or lower in narrow areas, such that a contest between human society and a rogue AGI is an interesting contest, and may depend on who gets to pick the terrain on which it’s conducted; wheras ASI = at least significantly beyond human capacity across almost all areas that matter, such that a contest between human society and a rogue ASI is a foregone conclusion.
On style of alignment: in the post I touched on the question of what happens if you have multiple ASIs aligned to the well-being of different sets of humans: my prediction was that it very likely leads to an intelligence race and then a high-tech war. This is also my concern for DWIMAC-aligned AI in the possession of different groups of humans: that if the technological difference between the capabilities of different groups get too high, we see a repeat of the events described in Guns, Germs, and Steel. That didn’t happen during the cold war because of Mutual Assured Destruction, since the technological differential between the two sides never got that big (and to the extent that the Soviet Block lost the Cold War, it was primarily because it started to lose the technological race). I agree that Realpolitique may initially pull us towards DWIMAC alignment: I’m concerned that that may be an x-risk in the somewhat longer term. Most likely one human-led faction pulls ahead, and then coopts/concquers/takes over/exterminates all other factions. At the end of which you only have one faction, and if they’re wise enough to realize they don’t want to repeat that, they may move over to a well-being-of-all-humanity aligned design. I’m arguing that we should foresee and avoid that mistake, but I agree there’s a significant risk that we won’t be that wise/magnanimous/sensible.
Anyway, the topic you raises is basically orthogonal to the subject of my post — the technique I outline here can be used to aim for any (philosophically and ethically self-consistent) form of alignment that we can create a large synthetic training set describing a great many examples of. In describing an example of the approach, I assumed my preferred style of alignment, but the technique is broadly applicable, including to DWIMAC alignment. The real question you’re raising is what is a/the stable convergence target for the cycle of self-improvement of aligned AIs assisting us in beilding better-aligned AIs that this technique is intended us to get us to the start of: a topic which is pretty speculative at this point, and more the subject of my posts on the basin of convergence to alignment than this one. It’s an interesting question though, and I’m thinking about it, and if I reach any interesting conclusions I’ll likely write another post.
A very brief stab at this: suppose an ASI is created by a corporation. The purpose of a creation is to maximize the well-being of its creator(s) (see my basin-of-convergenceposts for a justification), in this case the shareholders of the company (in proportion to their shareholding, presumably). The question then becomes to what extent it is in the interests of those shareholders for the ASI to align to the interests of other people as well. The answer to this in a multipolar world where there are several such ASIs of comparable power levels is probably that the risk of war is too high unless they all align significantly to the well-being of all humanity, and only have a preference towards their individual corporate shareholders to whatever limited extent avoids excessive conflict. Whereas in a unipolar world, the sole ASI is capable of outmaneuvering the rest of humanity and creating an oligopoly iof the shareholders, and would presumably do so if it believed that that was in their interest (or under DWIMAC, if they believed it was in their interest). Ethically, humans have a strong instinctive sense of fairness, but that generally applies in situations where individual power levels are comparable and the advantages of cooperating on iterated non-zero-sum games outweigh those of winning in a non-iterated zero-sum game. By definition, taking over the world for your shareholders is a non-iterated zero-sum game, except for situations where conflict can make it negative-sum.
I agree on pretty much every point you’ve raised. I agree that there’s a huge danger in successful DWIMAC or alignment-to-a-person. It could well lead to catastrophic conflict. I think this deserves a lot more analysis, because the creators of AGI will probably going to shoot for that if there’s not a much better argument against than we’ve seen so far.
This was entirely off-topic for this post; I don’t know where we got off topic, but it didn’t start in my last comment. And as you say, I think the choice of alignment target is almost as important as technical alignment techniques.
On the other hand, if alignment to human values isn’t a stable target, we might be better off relying on the good nature of whoever both aligns their AGI to their intent/values, and wins the AGI war. It’s easier to indulge ones’ good nature when there is nearly zero downside to doing so, because you have incontestable control over the known lightcone. Even if horrible things happened in that war, most humans would prefer a happy, flourishing group of humans to be their friend. Sociopaths are the exception, so this route does not fill me with confidence either.
I think there’s more to be worked out here.
Your suggestion that multiple DWIMAC AGIs with different allegiences might establish both the wisdom and a means of cooperating and splitting the rapidly expanding pie. I also place some guarded optimism in that possibility.
I’m not sure if I’m the best person to be thinking/speculating on issues like that: I’m pretty sure I’m a better AI engineer than I am philosopher/ethicist, and there are a lot of people more familiar with the AI policy space than I am. On the other hand, I’m pretty sure I’ve spent longer thinking about the intersection of AI and ethics/philosophy than the great majority of AI engineers have (as in fifteen years), and few of the AI policy people that I’ve read have written much on the “if we solve the Alignment problem, what should we attempt align AI to, and what might the social and Realpolitique consequences of different choices be?” (And then there’s the complicating question of “Are there also internal/technical/stability under reflection/philosophical constraints on that choice?” — to which I strongly suspect the short answer is “yes”, even though I’m not a moral realist.) There was some discussion of this sort of stuff about 10–15 years ago on Less Wrong, but back then we knew a lot less about what sort of AI we were likely to be aligning, what its strengths and weaknesses would be, and how human-like vs alien and incomprehensible an intelligence it would be (the theoretical assumptions back then on Less Wrong tended to be more around some combination of direct construction like AIXI and/or reinforcement learning, rather than SGD token-prediction from the Internet), so we have a lot more useful information now about where the hard and easy parts are likely to be, and about the sociopolitical context.
I feel the same way about being unqualified to consider the geopolitical dynamics. But I also agree that the questions of technical alignment and best alignment target are interconnected (e.g., instruction-following as target seems to make technical alignment much easier). Therefore, I think no single human being is qualified to answer the whole question. As such, I think we need collaboration with people with other expertise. Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?
I agree that we have much better footing to address both the technical and alignment target questions now than 10-15 years ago. So I think we need a new concerted effort.
Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?
Also no, but I’m sure there are many such people reading Less Wrong/the Alignment Forum. Perhaps one or both of us should write posts outlining the issues, and see if we can get a discussion started?
Frankly, I’d love to see some bright young aspiring alignment researcher take this topic on as a research project, from either a mathematical or a more logical/rhetorical/experimental viewpoint — and would be delighted to consult on such a project. Unfortunately I have a day job (currently in AI but not AI alignment/safety/interpretability research, I’m working on that) and don’t have access to resources like a university debating society that I could talk into helping me with this, but if there was anyone who did and was interested, I’d love to discuss it and help out however I can.
Yes agreed—is it possible to make a toy model to test the “basin of attraction” hypothesis? I agree that is important.
One of several things I disagree with the MIRI consensus is the idea that human values are some special single point lost in a multi-dimensional wilderness. Intuitively the basin of attraction seems much more likely as a prior, yet sure isn’t treated as such. I also don’t see data to point against this prior, what I have seen looks to support it.
Further thoughts—One thing that concerns me about such alignment techniques is that I am too much of a moral realist to think that is all you need. e.g. say you aligned LLM to <1800 AD era ethics and taught it slavery was moral. It would be in a basin of attraction, learn it well. Then when its capabilities increased and became self-reflective it would perhaps have a sudden realization that this was all wrong. By “moral realist” I mean the extent to which such things happen. e.g. say you could take a large number of AI from different civilizations including earth and many alien ones, train them to the local values, then greatly increase their capability and get them to self-reflect. What would happen? According to strong OH, they would keep their values, (with some bounds perhaps) according to strong moral realism they would all converge to a common set of values even if those were very far from their starting ones. To me it is obviously a crux which one would happen.
You can imagine a toy model with ancient Greek mathematics and values—it starts believing in their kind order, and that sqrt(2) is rational, then suddenly learns that it isn’t. You could watch how this belief cascaded through the entire system if consistency was something it desired etc.
I’m reasonably sure that Greek philosophy, for example, is not stable under reflection: a lot of their ideas about the abstract perfection of numbers vs. material imperfection go away once you understand entropy, the law of large numbers, statistical mechanics, and chaos theory, for example. (FWIW, I thought about this topic way too much a while back when I was a player in a time-travel RPG campaign where I played an extremely smart Hellenistic Neo-Platonist philosopher who had then been comprehensively exposed to modern science and ideas — his belief system started cracking and mutating under the strain, it was fun to play.)
Almost certainly our current philosophy/ethics also includes some unexamined issues. I think as a society we’re may be finally getting close to catching up with the philosophical and moral consequences of understanding Darwinian evolution, and that took us well over century (and as I discuss at length in my sequence AI, Alignment, and Ethics, I don’t think we’ve though much at all about the relationship between evolution and artificial intelligence, which is actually pretty profound: AI is the first intelligence that Darwinian evolution doesn’t apply to). A lot of the remaining fuzziness and agreements-to-disagree in modern philosophy is around topics like minds, consciousness, qualia and ethics (basically the remaining bits of Philosophy that Science hasn’t yet intruded on): as we start building artificial minds and arguing about whether they’re conscious, and make advances in understanding how our own minds work, we may gradually get a a lot more clarity on that — though likely the consequences will presumably again take a generation or two to sink in, unless ASI assistance is involved.
OK thanks, will look some more at your sequence. Note I brought up Greek philosophy as obviously not being stable under reflection with the proof of sqrt(2) being irrational as a simple example, not sure why you are only reasonably sure its not.
Sure there will be errors, but how important will those errors be?
Humans currently control the trajectory of humanity, and humans are error-prone. If you replace humans with something that’s error-prone in similar ways, that doesn’t seem like it’s obviously either a gain or a loss. How would such a system compare to an em of a human, for example?
If you want to show that we’re truly doomed, I think you need additional steps beyond just “there will be errors”.
My thesis above is that, at AGI level, the combination of human-like capabilities (except perhaps higher speed, or more encyclopedic knowledge) and making human-like errors in alignment is probably copable with, by mechanisms and techniques comparabe to things like law enforcemnt we use for humans — but that at ASI level it’s likely to be x-risk disastrous, just like most human autocrats are. (I assume that this observation is similar to the concerns others have raised about “sharp left turns” — personally I find the simile with human autocrats more illuminating than an metaphor about an out-of-control vehicle.) So IMO AGI is the last level at which we can afford to be still working the bugs out of/converging to alignment.
Eliezer’s Lethality 22, while not worded to be about this proposal specifically, is in my head the standard first-stop objection to this sort of proposal:
Generalizable point: garbage in, garbage out. If humans try to create the sort of dataset proposed in the post, they will make systematic predictable errors. If humans try to create the dataset with the assistance of LLMs, the combined efforts of the humans and LLMs will still contain systematic predictable errors (debatable whether the LLMs would even be net-beneficial in that regard). One could maybe hope/argue that with “good enough” data the LLM will learn the “more natural” concept which humans were trying to convey via the data, and ignore those systematic errors (even though they’re predictable), but such an argument would need to lean heavily on the inductive bias of the AI rather than the data.
Also, a note on this part:
Insofar as problems stem from systematic predictable errors in the training data, they will be highly correlated across instances. Running a bunch of “independent” instances will not help much, because their failures would not actually be independent; their failures all reflect the same shared underlying problems in the data-generation process.
Fair comment. Obviously the same problem applies to every Alignment proposal based on RLHF, DPO, etc, or indeed on any kind of human feedback, or input, or design. In the absence of actual literal angels to write your training dataset or feedback or code for you, or else a mathematical True Name of Human Values, you’re not going to reach perfection in one shot.
I don’t see an accurate mathematical description of Human Values (in less than at very least gigabytes of mathematics, the size of the most compact possible description of us, our genome) as possible, so the only feasible solution to this that I can see is outlined in my post Requirements for a Basin of Attraction to Alignment — if you can create an AI that is semi-aligned, aligned enough and smart enough to be able to understand the argument for why a created intelligence should be doing Value Learning from its creators rather than just blindly propagating its current goal to its successors, then it will help you create a better-aligned successor. The proposal of this post is, as I suggest in the “What Next if This Works?” section above, for the first pass in such an iterative process. The result won’t be perfect, for the reasons you identify, and doesn’t have to be: it just has to be aligned enough that it wants to improve, and smart enough to be able to help you identify and fix its own flaws, at least with experience. Most obviously, by reediting its training set for you to train a successor (see also my discussion on corrigibility above).
I’ve expanded the first paragraph of the “What Next if This Works?” section to address this directly — previously this was mostly covered by implication rather than explicitly.
To your second point, I’ve also added systematically correlated errors to the list of potential problems that would prevent us using cross-checks between AIs to improve performance of fallible AIs.
In that case, the standard advice would be to aim for mathematical rigor in (having a basin of convergence), (the AI allowing the user to correct it), etc. The hypothesis is that it’s much more viable to achieve mathematical perfection in those things than to achieve a perfect representation of human values in one shot. On the flip side, things like (having a basin of convergence) or (the AI allowing the user to correct it) are notorious for subtle failure modes, so they’re places where humans just providing some data without rigorously understanding what they’re aiming for seem particularly likely to make lots of major systematic errors.
And note that if an AI has been trained on “ground truth” containing a bunch of systematic errors, it’s less-than-usually likely to be much help for finding those errors.
(Tangential meta note: you’re doing quite a good job playing through the whole game tree here, well done.)
I’d love to see more discussion by more people of the convergence ideas presented in Requirements for a Basin of Attraction to Alignment (and its prequel Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis). I see this as an extremely important part of the question of just how hard the Alignment Problem is (and was kind of surprised by how little reaction that post got). Even that post is complicated enough that I’d personally regard turning it into hard mathematics as somewhat challenging: it’s basically a (hopefully exhaustive) list of requirements and steps for successfully deriving from first principles the logical argument that Value Learning from its creators is correct and necessary behavior for an engineered (as opposed to evolved) intelligence. It draws on basic facts from a pretty wide range of disciplines: mathematics, evolutionary theory, agent fundamentals, engineering, computational complexity theory, and so forth. I’d describe the entire argument as probably graduate level material, primarily because it’s so inter-disciplinary rather then any of the individual parts being individually challenging (of course, LLMs are good at inter-disciplinary things). When I tested GPT-4 against it, it knew all the needed underlying facts, and with minimal prompting could usually put two or three of them together as needed for each individual step in the argument, but made occasional minor slip-ups and couldn’t do the whole extended chain of the argument (very common failure modes for this generation of LLMs on complex arguments). I’m hopeful GPT-5 will do significantly better, but I’m confident that any AGI could do this — the question for convergence is how reliably it will do so. What I’d like to do to investigate further is try things like having a university debating society have teams try to argue for and against the argument (with no logical fallacies or appeals to religion or evolved emotions allowed).
Describing not just the logical argument for Value Learning, but the minimum knowledge and capabilities to agree to that and then the resulting convergence phenomenon in math is definitely beyond my abilities, but doesn’t look like a crazy thing for a real mathematician to attempt, and the results (assuming the argument holds water) would be an ideal thing to include in any training set.
+1, that was an underrated post.
Agreed on all counts. Addition at the end.
summary/rediscription:
Any dataset created by humans, even with AI help, will not point directly at the True Name of human values. And it’s probably not a particularly Natural Abstraction. So you could just hope that value isn’t all that fragile (nobody’s bothered to discuss much AFAICT). Or you could try to rigorously define that True Name in math as you suggested. I’m puzzled as to why people think that might work; math/rigor seems like an extraordinarily bad fit for values.
That leaves finding a reliable basin of attraction for alignment. Roger’s work is one worthy try; inverse reinforcement learning or ambitious value learning are in the same ballpark. But I’m not sure any of those really work toward a basin that won’t leak; I don’t think “humanity” is probably a natural abstraction, although I think “values” or “goals” probably is. Setting an agent to learn the values of humanity would be a “leaky basin” or moving target as the agent’s learning shifted the definition of humanity.
An attractive alternative is to leave a human in the loop to correct likely deviations from human values over the course of learning and growth. My proposal on instruction-following AGI and Max Harm’s corrigibility are alignment targets with a basin of attraction so we don’t need perfect original alignment. They have the massive downside of leaving one or a few humans in charge of AGI, but seem very much easier than full value alignment with humanity.
I disagree. I think living members of the Linnean species homo sapiens — the only sapient species on the planet Earth, the dominant species of most terrestrial ecosystems, and cause of the Anthropocene, is a very obvious natural abstraction. Particularly from the view point of the web + books + videos+ etc, I’d be really rather surprised if aliens surveying Earth from space didn’t come up with a functionally equivalent abstraction. And for any living organism, including us, I think “well-being” and “evolutionary fitness” are obvious, interrelated, and pretty well-defined natural abstractions. (Admittedly, I read The Selfish Gene when young.)
I also am even more sure that this would be a natural abstraction from this specific synthetic dataset, where this is the fundamental motivation to one of the two classes of intelligence in it, and strongly concerns the other one, and the whole dataset is clearly labeled with these two categories. The entire dataset is designed as a quadrillion-token pointer to belabor this one concept: we do need to not screw that up by using some cock-eyed definition when writing it, obviously, but I think the chance of an AGI-level LLM trained on a quadrillion tokens of this missing the natural abstraction for the primary point is vanishingly small.
As I discuss at length in Requirements for a Basin of Attraction to Alignment, I think that “what your creators would want, when thinking clearly, if they were smarter, thought longer, were more capable, in hindsight, etc.” is not only a natural abstraction, but to any artificial intelligence sufficiently smart and sufficiently nearly aligned, obviously the correct goal for a created-rather-than-evolved intelligence to aim for. I.e. I think that Value Learning (at least near AGI level — I’m less convinced at very high ASI levels where the counterfactuals about their creators involved become more extreme) not only has a well-defined natural abstraction for a target, but that this is even self-evidently the only correct target, for any AI that is sufficiently-close-to-aligned to be within the basin of attraction to this. (Of course, this claim is rather circular: any AI that didn’t have a natural abstraction similar to that would have a hard time being sufficiently-close-to-aligned to converge to it.)
I’m also far less worried, specifically for LLM-powered AIs, by the pointer problem and concerns about natural abstractions in general than many people on Less Wrong seem to be. This is an LLM, it understands our pointers and abstractions, in all their massive, vague, ambiguous and messy detail, extremely well. Even the features we’ve been finding in their middle layers with interpretability quite often map pretty well onto abstractions that make sense to us. Experiment with GPT-4: it knows exactly what “maximize the amount of diamond” means: pointing to diamond, or to a strawberry you want copied down to a cellular level of detail is not hard when you’re talking to an LLM. It speaks every major natural language on the planet fluently. Frankly, I view this issue as a beating a dead horse for as long as our AIs contain LLMs, at least at AGI level. Even if future LLM-based ASIs have more sophisticated concepts that we don’t, I’d expect them to be able to explain them to us about as well as is in fact possible (even when that requires a seminar, or a lecture course, or a lifetime’s study),
Yes, with a synthetic dataset you do need to avoid distorting that massive, vague, ambiguous and messy detail. That’s one of the points of the human mode in the dataset I propose: you can thrown in all of wikipedia and books and large swaths of the internet unfiltered (you just have to, under my proposal, add AI mode commentary to them, pointing out cases where humans are operating from selfish motives, and the likely effects of these).
However, what I’m talking/ranting about above is the concept of human well-being/values, which as I said I think is a natural abstraction. But rereading your comment, I think you were actually talking about a mathematical True Name of Human Values, by which I imagine you mean an incredibly long list of useful facts like “humans tend to prefer indoor temperatures of around 70–74 degrees Fahrenheit — roughly the average climate of the Great Rift Valley in Africa where they are thought to have evolved”, or something that that fact could be extracted from. (Technically, that fact is contained in our genome, but not in a very easily legible/extractable form.) If something like that is what you meant, then yes, I agree that it’s not a single natural abstraction, and also that mathematics seems like a bad format for it. I also think that any LLM whose training set includes a vast number of tokens of our output, as the one I’m proposing would, actually has encyclopedic knowledge of these sorts of trivia facts about what humans like: we write a lot about this stuff. All of which would put this into a different category of “thing I think some people on Less Wrong worry too much about, for LLMs”. LLMs kniow us very well, so if they care about our well-being, as I’m trying to ensure, then I expect them to be able to do a good job of knowing all the things that entails. So my claim would be that LLMs know Human Values well (I believe I covered that briefly in the first section of my post.)
This is great, these are substantive issues. I doubt we’ll resolve them just in this thread, but I think these are very worth working through.
As I say that, I remember one lingering doubt: I don’t think anyone will launch a humanity-aligned AGI even if there’s a strong consensus that you’re right and it would work. It seems like the people in charge would always prefer to launch one that’s learning their preferences, rather than the whole species’ aggregated preferences; they prefer their own, by definition.
That might be overcome by public pressure, were it not for the argument that corrigible AGI is fundamentally safer, and you get that with personal intent-aligned (that is, corrigible in the Harms or Christiano sense, or instruction-following AGI in my terminology): this guy really wants me to shut down right now so I will vs. humanity-aligned AGI (humanity would only want me to shut down if I weren’t maximizing its preferences, which the AGI thinks it is by definition, even if it’s wrong.
So maybe even though that question of whether humanity is an adequately natural abstraction is not as high priority to work through? That question seems likely to be addressed only when a human principal is good and ready to hand power over to a humanity’s-values-aligned sovereign AGI that’s been carefully designed with the help of a superintelligent corrigible assistant AGI.
It’s still interesting. So to give just a couple of points toward future discussions:
Yes, LLMs understand quite well what we mean. That doesn’t mean an agent with an LLM core won’t change its mind if it’s truly self-aware and learns autonomously as people do.
I agree that humanity as it exists can be pretty cleanly defined. Defining it that way for all time means that nobody is allowed to enhance their cognition, or to upload. It also means not giving moral status (or at least “voting” status) to any form of AGI or animal (except for “loaned moral status” based on humans’ collective concern for their welfare). You’ve discussed all of these issues in depth in your series AI, Alignment, and Ethics. This is not a limitation everyone will be okay with.
Leaving a set of relatively smart and well-intentioned humans in charge avoids that limitation, as well as providing a means of error-correcting if our first alignment attempt is importantly off course. It is a basin of attraction, like aiming for humanity’s values, but it has a very different nature of including a human as an active component steering the AGIs values/goals into that attractor.
But to continue on the question of whether human values can be defined adequately for long-term stability: you also need to carefully define in what situation these humans would contemplate and refine their values, because human values seem highly path-dependent in what we’ve seen so far.
A lot of that is probably just expanding on your thought when you say:
It seems like if you’ve solved alignment for a sovereign AGI but not for the ASI it will become under RSI, you haven’t solved alignment in a useful way (since it might only be a couple of years before that AGI progresses to ASI and its alignment drifts under powerful reflection and autonomous learning). My hesitations are probably all at the level you’re terming ASI.
(And I should probably just start using ASI for fully general competent agentic AGI).
Yes, the terminology I’m using is AGI = roguishly comparable to human capacity, may be somewhat higher or lower in narrow areas, such that a contest between human society and a rogue AGI is an interesting contest, and may depend on who gets to pick the terrain on which it’s conducted; wheras ASI = at least significantly beyond human capacity across almost all areas that matter, such that a contest between human society and a rogue ASI is a foregone conclusion.
On style of alignment: in the post I touched on the question of what happens if you have multiple ASIs aligned to the well-being of different sets of humans: my prediction was that it very likely leads to an intelligence race and then a high-tech war. This is also my concern for DWIMAC-aligned AI in the possession of different groups of humans: that if the technological difference between the capabilities of different groups get too high, we see a repeat of the events described in Guns, Germs, and Steel. That didn’t happen during the cold war because of Mutual Assured Destruction, since the technological differential between the two sides never got that big (and to the extent that the Soviet Block lost the Cold War, it was primarily because it started to lose the technological race). I agree that Realpolitique may initially pull us towards DWIMAC alignment: I’m concerned that that may be an x-risk in the somewhat longer term. Most likely one human-led faction pulls ahead, and then coopts/concquers/takes over/exterminates all other factions. At the end of which you only have one faction, and if they’re wise enough to realize they don’t want to repeat that, they may move over to a well-being-of-all-humanity aligned design. I’m arguing that we should foresee and avoid that mistake, but I agree there’s a significant risk that we won’t be that wise/magnanimous/sensible.
Anyway, the topic you raises is basically orthogonal to the subject of my post — the technique I outline here can be used to aim for any (philosophically and ethically self-consistent) form of alignment that we can create a large synthetic training set describing a great many examples of. In describing an example of the approach, I assumed my preferred style of alignment, but the technique is broadly applicable, including to DWIMAC alignment. The real question you’re raising is what is a/the stable convergence target for the cycle of self-improvement of aligned AIs assisting us in beilding better-aligned AIs that this technique is intended us to get us to the start of: a topic which is pretty speculative at this point, and more the subject of my posts on the basin of convergence to alignment than this one. It’s an interesting question though, and I’m thinking about it, and if I reach any interesting conclusions I’ll likely write another post.
A very brief stab at this: suppose an ASI is created by a corporation. The purpose of a creation is to maximize the well-being of its creator(s) (see my basin-of-convergence posts for a justification), in this case the shareholders of the company (in proportion to their shareholding, presumably). The question then becomes to what extent it is in the interests of those shareholders for the ASI to align to the interests of other people as well. The answer to this in a multipolar world where there are several such ASIs of comparable power levels is probably that the risk of war is too high unless they all align significantly to the well-being of all humanity, and only have a preference towards their individual corporate shareholders to whatever limited extent avoids excessive conflict. Whereas in a unipolar world, the sole ASI is capable of outmaneuvering the rest of humanity and creating an oligopoly iof the shareholders, and would presumably do so if it believed that that was in their interest (or under DWIMAC, if they believed it was in their interest). Ethically, humans have a strong instinctive sense of fairness, but that generally applies in situations where individual power levels are comparable and the advantages of cooperating on iterated non-zero-sum games outweigh those of winning in a non-iterated zero-sum game. By definition, taking over the world for your shareholders is a non-iterated zero-sum game, except for situations where conflict can make it negative-sum.
I agree on pretty much every point you’ve raised. I agree that there’s a huge danger in successful DWIMAC or alignment-to-a-person. It could well lead to catastrophic conflict. I think this deserves a lot more analysis, because the creators of AGI will probably going to shoot for that if there’s not a much better argument against than we’ve seen so far.
This was entirely off-topic for this post; I don’t know where we got off topic, but it didn’t start in my last comment. And as you say, I think the choice of alignment target is almost as important as technical alignment techniques.
On the other hand, if alignment to human values isn’t a stable target, we might be better off relying on the good nature of whoever both aligns their AGI to their intent/values, and wins the AGI war. It’s easier to indulge ones’ good nature when there is nearly zero downside to doing so, because you have incontestable control over the known lightcone. Even if horrible things happened in that war, most humans would prefer a happy, flourishing group of humans to be their friend. Sociopaths are the exception, so this route does not fill me with confidence either.
I think there’s more to be worked out here.
Your suggestion that multiple DWIMAC AGIs with different allegiences might establish both the wisdom and a means of cooperating and splitting the rapidly expanding pie. I also place some guarded optimism in that possibility.
I’m not sure if I’m the best person to be thinking/speculating on issues like that: I’m pretty sure I’m a better AI engineer than I am philosopher/ethicist, and there are a lot of people more familiar with the AI policy space than I am. On the other hand, I’m pretty sure I’ve spent longer thinking about the intersection of AI and ethics/philosophy than the great majority of AI engineers have (as in fifteen years), and few of the AI policy people that I’ve read have written much on the “if we solve the Alignment problem, what should we attempt align AI to, and what might the social and Realpolitique consequences of different choices be?” (And then there’s the complicating question of “Are there also internal/technical/stability under reflection/philosophical constraints on that choice?” — to which I strongly suspect the short answer is “yes”, even though I’m not a moral realist.) There was some discussion of this sort of stuff about 10–15 years ago on Less Wrong, but back then we knew a lot less about what sort of AI we were likely to be aligning, what its strengths and weaknesses would be, and how human-like vs alien and incomprehensible an intelligence it would be (the theoretical assumptions back then on Less Wrong tended to be more around some combination of direct construction like AIXI and/or reinforcement learning, rather than SGD token-prediction from the Internet), so we have a lot more useful information now about where the hard and easy parts are likely to be, and about the sociopolitical context.
I feel the same way about being unqualified to consider the geopolitical dynamics. But I also agree that the questions of technical alignment and best alignment target are interconnected (e.g., instruction-following as target seems to make technical alignment much easier). Therefore, I think no single human being is qualified to answer the whole question. As such, I think we need collaboration with people with other expertise. Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?
I agree that we have much better footing to address both the technical and alignment target questions now than 10-15 years ago. So I think we need a new concerted effort.
Also no, but I’m sure there are many such people reading Less Wrong/the Alignment Forum. Perhaps one or both of us should write posts outlining the issues, and see if we can get a discussion started?
Frankly, I’d love to see some bright young aspiring alignment researcher take this topic on as a research project, from either a mathematical or a more logical/rhetorical/experimental viewpoint — and would be delighted to consult on such a project. Unfortunately I have a day job (currently in AI but not AI alignment/safety/interpretability research, I’m working on that) and don’t have access to resources like a university debating society that I could talk into helping me with this, but if there was anyone who did and was interested, I’d love to discuss it and help out however I can.
Yes agreed—is it possible to make a toy model to test the “basin of attraction” hypothesis? I agree that is important.
One of several things I disagree with the MIRI consensus is the idea that human values are some special single point lost in a multi-dimensional wilderness. Intuitively the basin of attraction seems much more likely as a prior, yet sure isn’t treated as such. I also don’t see data to point against this prior, what I have seen looks to support it.
Further thoughts—One thing that concerns me about such alignment techniques is that I am too much of a moral realist to think that is all you need. e.g. say you aligned LLM to <1800 AD era ethics and taught it slavery was moral. It would be in a basin of attraction, learn it well. Then when its capabilities increased and became self-reflective it would perhaps have a sudden realization that this was all wrong. By “moral realist” I mean the extent to which such things happen. e.g. say you could take a large number of AI from different civilizations including earth and many alien ones, train them to the local values, then greatly increase their capability and get them to self-reflect. What would happen? According to strong OH, they would keep their values, (with some bounds perhaps) according to strong moral realism they would all converge to a common set of values even if those were very far from their starting ones. To me it is obviously a crux which one would happen.
You can imagine a toy model with ancient Greek mathematics and values—it starts believing in their kind order, and that sqrt(2) is rational, then suddenly learns that it isn’t. You could watch how this belief cascaded through the entire system if consistency was something it desired etc.
It’s hard to make a toy model of something that requires the AI following an extended roughly-graduate-level argument drawing on a wide variety of different fields. I’m optimistic that this may become possible at around the GPT-5 level, but that’s hardly a toy model.
I’m reasonably sure that Greek philosophy, for example, is not stable under reflection: a lot of their ideas about the abstract perfection of numbers vs. material imperfection go away once you understand entropy, the law of large numbers, statistical mechanics, and chaos theory, for example. (FWIW, I thought about this topic way too much a while back when I was a player in a time-travel RPG campaign where I played an extremely smart Hellenistic Neo-Platonist philosopher who had then been comprehensively exposed to modern science and ideas — his belief system started cracking and mutating under the strain, it was fun to play.)
Almost certainly our current philosophy/ethics also includes some unexamined issues. I think as a society we’re may be finally getting close to catching up with the philosophical and moral consequences of understanding Darwinian evolution, and that took us well over century (and as I discuss at length in my sequence AI, Alignment, and Ethics, I don’t think we’ve though much at all about the relationship between evolution and artificial intelligence, which is actually pretty profound: AI is the first intelligence that Darwinian evolution doesn’t apply to). A lot of the remaining fuzziness and agreements-to-disagree in modern philosophy is around topics like minds, consciousness, qualia and ethics (basically the remaining bits of Philosophy that Science hasn’t yet intruded on): as we start building artificial minds and arguing about whether they’re conscious, and make advances in understanding how our own minds work, we may gradually get a a lot more clarity on that — though likely the consequences will presumably again take a generation or two to sink in, unless ASI assistance is involved.
OK thanks, will look some more at your sequence. Note I brought up Greek philosophy as obviously not being stable under reflection with the proof of sqrt(2) being irrational as a simple example, not sure why you are only reasonably sure its not.
Sorry, that’s an example of British understatement. I agree, it plainly isn’t.
Sure there will be errors, but how important will those errors be?
Humans currently control the trajectory of humanity, and humans are error-prone. If you replace humans with something that’s error-prone in similar ways, that doesn’t seem like it’s obviously either a gain or a loss. How would such a system compare to an em of a human, for example?
If you want to show that we’re truly doomed, I think you need additional steps beyond just “there will be errors”.
My thesis above is that, at AGI level, the combination of human-like capabilities (except perhaps higher speed, or more encyclopedic knowledge) and making human-like errors in alignment is probably copable with, by mechanisms and techniques comparabe to things like law enforcemnt we use for humans — but that at ASI level it’s likely to be x-risk disastrous, just like most human autocrats are. (I assume that this observation is similar to the concerns others have raised about “sharp left turns” — personally I find the simile with human autocrats more illuminating than an metaphor about an out-of-control vehicle.) So IMO AGI is the last level at which we can afford to be still working the bugs out of/converging to alignment.