This is an interesting historical perspective… But it’s not really what the fundamental case for AGI doom routes through. In particular: AGI doom is not about “AI systems”, as such.
AGI doom is, specifically, about artificial generally intelligent systems capable of autonomously optimizing the world the way humans can, and who are more powerful at this task than humans. The AGI-doom arguments do not necessarily have anything to do with the current SoTA ML models.
Case in point: A manually written FPS bot is technically “an AI system”. However, I think you’d agree that the AGI-doom arguments were never about this type of system, despite it falling under the broad umbrella of “an AI system”.
Similarly, if a given SoTA ML model architecture fails to meet the definition of “a generally intelligent system capable of autonomously optimizing the world the way humans can”, then the AGI doom is not about it. The details of its workings, therefore, have little to say, one way or another, about the AGI doom.
Why are the AGI-doom concerns extended to the current AI-capabilities research, then, if the SoTA models don’t fall under said concerns? Well, because building artificial generally intelligent systems is something the AGI labs are specifically and deliberately trying to do. Inasmuch as the SoTA models are not the generally intelligent systems that are within the remit of the AGI-doom arguments, and are instead some other type of systems, the current AGI labs view this as their failure that they’re doing their best to “fix”.
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they’re claims that any “artificial generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
In other words: The set of AI systems defined by “a generally intelligent world-optimization-capable agent”, and the set of AI systems defined by “the subject of fundamental AGI-doom arguments”, is the same set of systems. You can’t have the former without the latter. And the AI industry wants the former; therefore, the arguments go, it will unleash the latter on the world.
While, yes, the current SoTA models are not subjects of the AGI doom arguments, that doesn’t matter, because the current SoTA models are incidental research artefacts that are produced on AI industry’s path to building an AGI. The AGI-doom arguments apply to the endpoint of that process, not the messy byproducts.
So any evidence we uncover about how the current models are not dangerous the way AGI-doom arguments predict AGIs to be dangerous, is just evidence that they’re not AGI yet. It’s not evidence that AGI would not be dangerous. (Again: FPS bots’ non-dangerousness isn’t evidence that AGI would be non-dangerous.)
(I’d written some more about this topic here. See also gwern’s Why Tool AIs Want to Be Agent AIs for more arguments regarding why AI research’s endpoint would be an AI agent, instead of something as harmless and compliant as the contemporary models.)
Counterarguments to AGI-doom arguments that focus on pointing to the SoTA models, as such, miss the point. Actual counterarguments would instead find some way to argue that “generally intelligent world-optimizing agents” and “subjects of AGI-doom arguments” are not the exact same type of system; that you can, in theory, have the former without the latter. I have not seen any such argument, and the mathematical noose around them is slowly tightening (uh, by which I mean: their impossibility may be formally provable).
There is a difference between the claim that powerful agents are approximately well-described as being expected utility maximizers (which may or may not be true) and the claim that AGI systems will have an explicit utility function the moment they’re turned on, and maximize that function from that moment on.
I think this is the assumption OP is pointing out: “most of the book’s discussion of AI risk frames the AI as having a certain set of goals from the moment it’s turned on, and ruthlessly pursuing those to the best of its ability”. “From the moment it’s turned on” is pretty important, because it rules out value learning as a solution
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they’re claims that any “artificial generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
If you drop the “artificially” from the claim, you are left with a claim that any “generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Do you endorse that claim, or do think that there is some particular reason a biological or hybrid generally intelligent system capable of autonomously optimizing the world the way a human or an organization based on humans might not be well-approximated as a game-theoretic agent?
Because humans sure don’t seem like paperclipper-style utility maximizers to me.
Do you endorse [the claim that any “generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent?]
Yes.
Because humans sure don’t seem like paperclipper-style utility maximizers to me.
Humans are indeed hybrid systems. But I would say that inasmuch as they act as generally intelligent systems capable of autonomously optimizing the world in scarily powerful ways, they do act as game-theoretic agents. E. g., people who are solely focused on resource accumulation, and don’t have self-destructive vices or any distracting values they’re not willing to sacrifice to Moloch, tend to indeed accumulate power at a steady rate. At a smaller scope, people tend to succeed at those of their long-term goals that they’ve clarified for themselves and doggedly pursue; and not succeed at them if they flip-flop between different passions on a daily basis.
I’ve been meaning to do some sort of literature review solidly backing this claim, actually, but it hasn’t been a priority for me. Hmm, maybe it’d be easy with the current AI tools...
By “hybrid system” I actually meant “system composed of multiple humans plus external structure”, sorry if that was unclear. Concretely I’m thinking of things like “companies” and “countries”.
people who are solely focused on resource accumulation, and don’t have self-destructive vices or any distracting values they’re not willing to sacrifice to Moloch, tend to indeed accumulate power at a steady rate.
I don’t see how one gets from this observation to the conclusion that humans are well-approximated as paperclipper-style agents.
I suppose it may be worth stepping back to clarify that when I say “paperclipper-style agents”, I mean “utility maximizers whose utility function is a function of the configuration of matter at some specific time in the future”. That’s a super-finicky-sounding definition but my understanding is that you have to have a definition that looks like that if you want to use coherence theorems, and otherwise you end up saying that a rock is an agent that maximizes the utility function “behave like a rock”.
It does not seem to me that very many humans are trying to maximize the resources under their control at the time of their death, nor does it seem like the majority of the resources in the world are under the control of the few people who have decided to do that. It is the case that people who care at all about obtaining resources control a significant fraction of the resources, but I don’t see a trend where the people who care maximally about controlling resources actually control a lot more resources than the people who care somewhat about controlling resources, as long as they still have time to play a round of golf or do whatever else they enjoy.
I like this exchange and the clarifications on both sides. I’ll add my response:
You’re right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it’s multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions. But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren’t determined by near-term constraints, when longer-term goals dominate.
(a relevant part of Eliezer’s recent thread is “then probably one of those pieces runs over enough of the world-model (or some piece of reality causally downstream of enough of the world-model) that It can always do a little better by expending one more erg of energy.”, but it should be read in context)
Another missing piece here might be: The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there. This is the thing that makes agents useful and valuable. And it’s the main thing that separates agents from most other computer programs.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes. This does seem important in terms of alignment solutions. And it takes some steam out of the arguments that go “coherent therefore incorrigible” (or it at least should add some caveats). But this only helps us if we have a lot of control over the preferences&constraints of the agent, and it has a couple of stability properties.
I like this exchange and the clarifications on both sides.
Yeah, it feels like it’s getting at a crux between the “backchaining / coherence theorems / solve-for-the-equilibrium / law thinking” cluster of world models and the “OODA loop / shard theory / interpolate and extrapolate / toolbox thinking” cluster of world models.
You’re right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it’s multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions
Humans do seem to have strong preferences over immediate actions. For example, many people prefer not to lie, even if they think that lying will help them achieve their goals and they are confident that they will not get caught.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren’t determined by near-term constraints, when longer-term goals dominate. [...] The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there.
I expect that agents which predictably behave in the way EY describes as “going hard” (i.e. attempting to achieve their long-term goal at any cost) will find it harder to find other agents who will cooperate with them. It’s not a binary choice between “care about process” and “care about outcomes”—it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes.
Yeah. Or strong preferences over processes (although I suppose you can frame a preference over process as a preference over there not being any intermediate time where the agent is actively executing some specific undesired behavior).
But this only helps us if we have a lot of control over the preferences&constraints of the agent,
It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
and it has a couple of stability properties.
I doubt that systems trained with ML techniques have these properties. But I don’t think e.g. humans or organizations built out of humans + scaffolding have these properties either, and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
Humans do seem to have strong preferences over immediate actions.
I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
Yeah same. Although legible commitments or decision theory can serve the same purpose better, it’s probably harder to evolve because it depends on higher intelligence to be useful. The level of transparency of agents to each other and to us seems to be an an important factor. Also there’s some equilibrium, e.g. in an overly honest society it pays to be a bit more dishonest, etc.
It does unfortunately seem easy and useful to learn rules like honest-to-tribe or honest-to-people-who-can-tell or honest-unless-it’s-really-important or honest-unless-I-can-definitely-get-away-with-it.
attempting to achieve their long-term goal at any cost
I think if you remove “at any cost”, it’s a more reasonable translation of “going hard”. It’s just attempting to achieve a long-term goal that is hard to achieve. I’m not sure what “at any cost” adds to it, but I keep on seeing people add it, or add monomaniacally, or ruthlessly. I think all of these are importing an intuition that shouldn’t be there. “Going hard” doesn’t mean throwing out your morality, or sacrificing things you don’t want to sacrifice. It doesn’t mean being selfish or unprincipled such that people don’t cooperate with you. That would defeat the whole point.
It’s not a binary choice between “care about process” and “care about outcomes”—it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
Yes!
It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
No!
I doubt that systems trained with ML techniques have these properties. But I don’t think e.g. humans or organizations built out of humans + scaffolding have these properties either
Yeah mostly true probably.
and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
I’m talking about stability properties like “doesn’t accidentally radically change the definition of its goals when updating its world-model by making observations”. I agree properties like this don’t seem to be on the fastest path to build AGI.
uspect[faul_sname] Humans do seem to have strong preferences over immediate actions.
[Jeremy Gillen] I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
Point of clarification: the type of strong preferences I’m referring to are more deontological-injunction shaped than they are habit-shaped. I expect that a preference not to exhibit the behavior of murdering people would not meaningfully hinder someone whose goal was to get very rich from achieving that goal. One could certainly imagine cases where the preference not to murder caused the person to be less likely to achieve their goals, but I don’t expect that it would be all that tightly binding of a constraint in practice, and so I don’t think pondering and philosophizing until they realize that they value one murder at exactly -$28,034,771.91 would meaningfully improve that person’s ability to get very rich.
I think if you remove “at any cost”, it’s a more reasonable translation of “going hard”. It’s just attempting to achieve a long-term goal that is hard to achieve.
I think there’s more to the Yudkowsky definition of “going hard” it than “attempting to achieve hard long-term goals”. Take for example:
@ESYudkowsky Mossad is much more clever and powerful than novices implicitly imagine a “superintelligence” will be; in the sense that, when novices ask themselves what a “superintelligence” will be able to do, they fall well short of the actual Mossad.
@ESYudkowsky Why? Because Mossad goes hard; and people who don’t go hard themselves, have no simple mental motion they can perform—no simple switch they can access—to imagine what it is actually like to go hard; and what options become available even to a mere human when you do.
My interpretation of the specific thing that made Mossad’s actions an instance of “going hard” here was that they took actions that most people would have thought of as “off limits” in the service of achieving their goal (and that doing so actually helped them achieve their goal (and that it actually worked out for them—we don’t generally say that Elizabeth Holmes “went hard” with Theranos). The supply chain attack in question does demonstrate significant technical expertise, but it also demonstrates a willingness to risk provoking parties that were uninvolved in the conflict in order to achieve their goals.
Perhaps instead of “attempting to achieve the goal at any cost” it would be better to say “being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal”.
[faul_sname] It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
[Jeremy Gillen] No!
I suspect we may be talking past each other here. Some of the specific things I observe:
RLHF works pretty well for getting LLMs to output text which is similar to text which has been previously rated as good, and dissimilar to text which has previously been rated as bad. It doesn’t generalize perfectly, but it does generalize well enough that you generally have to use adversarial inputs to get it to exhibit undesired behavior—we call them “jailbreaks” not “yet more instances of bomb creation instructions”.
Along those lines, RLAIF also seems to Just Work™.
And the last couple of years have been a parade of “the dumbest possible approach works great actually” results, e.g.
“Sure, fine-tuning works, but what happens if we just pick a few thousand weights and only change those weights, and leave the rest alone?” (Answer: it works great)
“I want outputs that are more like thing A and less like thing B, but I don’t want to spend a lot of compute on fine tuning. Can I just compute both sets of activations and subtract the one from the other?” (Answer: Yep!)
“Can I ask it to write me a web application from a vague natural language description, and have it make reasonable choices about all the things I didn’t specify” (Answer: astonishing amounts of yes)
Take your pick of the top chat-tuned LLMs. If you ask it about a situation and ask what a good course of action would be, it will generally give you pretty sane answers.
So from that, I conclude:
We have LLMs which understand human values, and can pretty effectively judge how good things are according to those values, and output those judgements in a machine-readable format
We are able to tune LLMs to generate outputs that are more like the things we rate as good and less like the things we rate as bad
Put that together and that says that, at least at the level of LLMs, we do in fact have AIs which understand human morality and care about it to the extent that “care about” is even the correct abstraction for the kind of thing they do.
I expect this to continue to be true in the future, and I expect that our toolbox will get better faster than the AIs that we’re training get more capable.
I’m talking about stability properties like “doesn’t accidentally radically change the definition of its goals when updating its world-model by making observations”.
What observations lead you to suspect that this is a likely failure mode?
Yeah I’m on board with deontological-injunction shaped constraints. See here for example.
Perhaps instead of “attempting to achieve the goal at any cost” it would be better to say “being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal”.
Nah I still disagree. I think part of why I’m interpreting the words differently is because I’ve seen them used in a bunch of places e.g. the lightcone handbook to describe the lightcone team. And to describe the culture of some startups (in a positively valenced way).
Being willing to be creative and unconventional—sure, but this is just part of being capable and solving previously unsolved problems. But disregarding conventions that are important for cooperation that you need to achieve your goals? That’s ridiculous.
Being willing to impose costs on uninvolved parties can’t be what is implied by ‘going hard’ because that depends on the goals. An agent that cares a lot about uninvolved parties can still go hard at achieving its goals.
I suspect we may be talking past each other here.
Unfortunately we are not. I appreciate the effort you put into writing that out, but that is the pattern that I understood you were talking about, I just didn’t have time to write out why I disagreed.
I expect this to continue to be true in the future
This is the main point where I disagree. The reason I don’t buy the extrapolation is that there are some (imo fairly obvious) differences between current tech and human-level researcher intelligence, and those differences appear like they should strongly interfere with naive extrapolation from current tech. Tbh I thought things like o1 or alphaproof might cause the people who naively extrapolate from LLMs to notice some of these, because I thought they were simply overanchoring on current SoTA, and since the SoTA has changed I thought they would update fast. But it doesn’t seem to have happened much yet. I am a little confused by this.
What observations lead you to suspect that this is a likely failure mode?
I didn’t say likely, it’s more an example of an issue that comes up so far when I try to design ways to solve other problems. Maybe see here for instabilities in trained systems, or here for more about that particular problem.
I’m going to drop out of this conversation now, but it’s been good, thanks! I think there are answers to a bunch of your claims in my misalignment and catastrophe post.
I largely concur, but I think the argument is simpler and more intuitive. I want to boil this down a little and try to state it in plainer language:
Arguments for doom as a default apply to any AI that has unbounded goals and pursues those goals more competently than humans. Maximization, coherence, etc. are not central pieces.
Current AI doesn’t really have goals, so it’s not what we’re worried about. But we’ll give AI goals, because we want agents to get stuff done for us, and giving them goals seems necessary for that. All of the concerns for the doom argument will apply to real AI soon enough.
However, current AI systems may suggest a route to AGI that dodges soom of the more detailed doom arguments. Their relative lack of inherent goal-directedness and relative skill at following instructions true to their intent (and the human values behind them) may be cause for guarded optimism. One of my attempts to explain this is The (partial) fallacy of dumb superintelligence.
In a different form, the doom as default argument is:
IF an agent is smarter/more competent than you and
Has goals that conflict with yours
It will outsmart you somehow, eventually (probably soon)
It will achieve its goals and you will correspondingly not achieve yours
If its goals are unbounded and it “cares” about your goals near zero,
You will lose everything
Arguments that “we’re training it on human data so it will care about our values above zero” are extremely speculative. They could be true, but betting the future of humanity on it without thinking it through seems very, very foolish.
That’s my attempt at the simplest form of the doom by default argument
Just to point out the one distinction: I make no reference to game theoretic agents or coherence theorems. I think this are unnecessary distractions to the core argument. An agent that has weird and conflicting goals (and so isn’t coherent or a perfect game theoretic agent) will still take all of your stuff if its set of goals and values don’t weigh human property rights or human wellbeing very highly. That’s why we take the alignment problem to be the central problem in surviving AGI.
The other question implicit in this post was, why would we make AI less safe than current systems, which would remain pretty safe even if they were a lot smarter.
Asking why in the world humans would make AI with its own goals is like asking why in the world we’d create dynamite, much less nukes: because it will help humans accomplish their goals, until it doesn’t; and it’s as easy as calling your safe oracle AI (e.g., really good LLM) repeatedly with “what would an agent trying to accomplish X do with access to tools Y?” and passing the output to those tools. Agency is a one-line extension, and we’re not going to just not bother.
I like your comment, but I do want to comment this:
Arguments that “we’re training it on human data so it will care about our values above zero” are extremely speculative. They could be true, but betting the future of humanity on it without thinking it through seems very, very foolish.
has evidence against it, fortunately for us.
I summarize the evidence for the pretty large similarities between the human brain and current DL systems, and this allows us to transport insights from AI into neuroscience, and vice versa here:
But the point here is that one of the lessons from AI that is likely to transfer over to human values is that the data matters way more than the algorithm, optimizer, architecture, or hyperparameter choices.
I don’t go as far as this link does in claiming that the it in AI models is the data set, but I think a weaker version of this is basically right, and thus the bitter lesson holds for human values too:
While 2024 SoTA models are not capable of autonomously optimizing the world, they are really smart, perhaps 1⁄2 or 2⁄3 of the way there, and already beginning to make big impacts on the economy. As I said in response to your original post, because we don’t have 100% confidence in the coherence arguments, we should take observations about the coherence level of 2024 systems as evidence about how coherent the 203X autonomous corporations will need to be. Evidence that 2024 systems are not dangerous is both evidence that they are not AGI and evidence that AGI need not be dangerous.
I would agree with you if the coherence arguments were specifically about autonomously optimizing the world and not about autonomously optimizing a Go game or writing 100-line programs, but this doesn’t seem to be the case.
This is just a conjecture, and there has not really been significant progress on the agent-like structure conjecture. I don’t think it’s fair to say we’re making good progress on a proof.
This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior? In this world what the believers in coherence really need to show is that almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence. Then for the argument to carry through you need to show they are also high on some metric of incorrigibility, or fragile to value misspecification. None of the classic coherence results quite hit this.
However AFAIK @Jeremy Gillen does not think we can get an argument with exactly this structure (the main argument in his writeup is a bit different), and Eliezer has historically and recently made the argument that EU maximization is simple and natural. So maybe you do need this argument that an EU maximization algorithm is simpler than other algorithms, which seems like it needs some clever way to formalize it, because proving things about the space of all simple programs seems too hard.
I think you are mischaracterizing my beliefs here.
“almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence.”
This seems right to me. Maybe see my comment further up, I think it’s relevant to arguments we’ve had before.
This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior?
We can’t say much about the detailed internal structure of an agent, because there’s always a lot of ways to implement an algorithm. But we do only care about (generalizing) behavior, so we only need some very abstract properties relevant to that.
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they’re claims that any “artificial generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
This is my crux with people who have 90+% P(doom): will vNM expected utility maximization be a good approximation of the behavior of TAI? You argue that it will, but I expect that it won’t.
My thinking related to this crux is informed less by the behaviors of current AI systems (although they still influence it to some extent) than by the failure of the agent foundations agenda. The dream 10 years ago was that if we started by modeling AGI as an vNM expected utility maximizer, and then gradually added more and more details to our model to account for differences between the idealized model and real-world AI systems, we would end up with an accurate theoretical system for predicting the behaviors AGI would exhibit. It would be a similar process to how physicists start with an idealized problem setup and add in details like friction or relativistic corrections.
But that isn’t what ended up happening. Agent foundations researchers ended up getting stuck on the cluster of problems collectively described as embedded agency, unable to square the dualistic assumptions of expected utility theory and Bayesianism with the embedded structure of real-world AI systems. The sub-problems of embedded agency are many and too varied to allow one elegant theorem to fix everything. Instead, they point to a fundamental flaw in the expected utility maximizer model, suggesting that it isn’t as widely applicable as early AI safety researchers thought.
The failure of the agent foundations agenda has led me to believe that expected utility maximization is only a good approximation for mostly-unembedded systems, and that an accurate theoretical model of advanced AI behavior (if such a thing is possible) would require a fundamentally different, less dualistic set of concepts. Coherence theorems and decision-theoretic arguments still rely on the old, unembedded assumptions and therefore don’t provide an accurate predictive model.
I agree that the agent-foundations research has been somewhat misaimed from the start, but I buy this explanation of John’s regarding where it went wrong and how to fix it. Basically, what we need to figure out is a theory of embedded world-modeling, which would capture the aspect of reality where the universe naturally decomposes into hierarchically arranged sparsely interacting subsystems. Our agent would then be a perfect game-theoretic agent, but defined over that abstract (and lazy) world-model, rather than over the world directly.
This would take care of agents needing to be “bigger” than the universe, counterfactuals, the “outside-view” problem, the realizability and the self-reference problems, the problem of hypothesis spaces, and basically everything else that’s problematic about embedded agency.
A theory of embedded world-modeling would be an improvement over current predictive models of advanced AI behavior, but it wouldn’t be the whole story. Game theory makes dualistic assumptions too (e.g., by treating the decision process as not having side effects), so we would also have to rewrite it into an embedded model of motivation.
Cartesian frames are one of the few lines of agent foundations research in the past few years that seem promising, due to allowing for greater flexibility in defining agent-environment boundaries. Preferably, we would have a model that lets us avoid having to postulate an agent-environment boundary at all. Combining a successor to Cartesian frames with an embedded theory of motivation, likely some form of active inference, might give us an accurate overarching theory of embedded behavior.
It turns out in an idealized model of intelligent AI, we can remove the dualistic assumptions of game theory by instead positing a reflective oracle, and the reflective oracle is allowed randomness in the territory (it is not just uncertainty in the map) to prevent paradoxes, and in particular the reflective oracle’s randomized answers are exactly the Nash-Equilibria of game theory, because there is a one-to-one function between a reflective oracle and a Nash-equilibrium.
Of course, whether it can transfer to our reality at all is pretty sketchy at best, but at least there is a solution at all:
The reflective oracle model doesn’t have all the properties I’m looking for—it still has the problem of treating utility as the optimization target rather than as a functional component of an iterative behavior reinforcement process. It also treats the utilities of different world-states as known ahead of time, rather than as the result of a search process, and assumes that computation is cost-free. To get a fully embedded theory of motivation, I expect that you would need something fundamentally different from classical game theory. For example, it probably wouldn’t use utility functions.
Re treating utility as the optimization target, I think this isn’t properly speaking an embedded agency problem, but rather an empirical problem of what the first AIs that automate everything will look like algorithmically, as there are algorithms that are able to be embedded in reality that do optimize the utility/reward like MCTS, and TurnTrout limits the post to the model-free policy gradient case like PPO and REINFORCE.
TurnTrout is correct to point out that not all RL algorithms optimize for the reward, and reward isn’t what the agent optimizes for by definition, but I think that it’s too limited in describing when RL does optimize for the utility/reward.
So I think the biggest difference between @TurnTrout and people like @gwern et al is whether or not model-based RL that does plan or model-free RL policy gradient algorithms come to dominate AI progress over the next decade.
Agree that the fact that it treats utilities of different world states as known and that the cost of computation is free makes it a very unrealistic model for human beings, and while something like the reflective oracle model is a possibility if we warped the laws of physics severely enough, such that we don’t have to care about the cost of computation at all, which then allows us to go from treating utilities as unknown to known in 1 step, this is an actual reason why I don’t expect the reflective oracle model to transfer to reality at all.
find some way to argue that “generally intelligent world-optimizing agents” and “subjects of AGI-doom arguments” are not the exact same type of system
We could maybe weaken this requirement? Perhaps it would suffice to show/argue that it’s feasible[1] to build any kind of “acute risk period -ending AI”[2] that is not a “subject of AGI-doom arguments”?
If I became convinced that it’s feasible to build such a “pivotal AI” that is not “subject to AGI doom arguments”, I think that would shift a bunch of my probability mass from “we die due to unaligned AI” to “we die-or-worse due to misaligned humans controlling ASI” and “utopia”.
My short answer is that the argument would consist of human values are quite simple and are most likely a reasonably natural abstraction, and the felt complexity is due to adding both the complexity of the generators and the data, which people wouldn’t do for AI capabilities, meaning the bitter lesson holds for human values and morals as well.
Also, the way AI is aligned depends far more on the data that is given and our control over synthetic data means we can get AIs that follow human values before it gets too capable to take over everything, and evolutionary psychology mispredicted this and the above point pretty hard, making it lose many Bayes points compared to the Universal Learning Machine/Blank Slate hypotheses.
Alignment generalizes further than capabilities for pretty deep reasons, contra Nate Soares but basically it’s way easier to have an AI care about human values than it is to get it to be capable in real-world domains, combined with verification being easier than generation.
Finally, there is evidence that AIs are far more robust to errors than people thought 15-20 years ago.
In essence, it’s a negation of the following:
Fragility and Complexity of Value
Pretty much all of evolutionary psychology literature.
Mm, there are two somewhat different definitions of what counts as “a natural abstraction”:
I would agree that human values are likely a natural abstraction in the sense that if you point an abstraction-learning algorithm at the dataset of modern humans doing things, “human values” and perhaps even “eudaimonia” would fall out as a natural principal component of that dataset’s decomposition.
What I wouldn’t agree with is that human values are a natural abstraction in the sense that a mind pointed at the dataset of this universe doing things, or at the dataset of animals doing things, or even at the dataset of prehistoric or medieval humans doing things, would learn modern human values.
Let’s step back a bit.
Suppose we have a system Alpha and a system Beta, with Beta embedded in Alpha. Alpha starts out with a set of natural abstractions/subsystems. Beta, if it’s an embedded agent, learns these abstractions, and then starts executing actions within Alpha that alter its embedding environment. Over the course of that, Beta creates new subsystems, corresponding to new abstractions.
As concrete examples, you can imagine:
The lifeless universe as Alpha (with abstractions like “stars”, “gasses”, “seas”), and the biosphere as Beta (creating abstractions like “organisms” and “ecosystems” and “predator” and “prey”).
The biosphere as Alpha (with abstractions like “food” and “species”) and the human civilization as Beta (with abstractions like “luxury” and “love” and “culture”).
Notice one important fact: the abstractions Beta creates are not, in general, easy-to-predict from the abstractions already in Alpha. “A multicellular organism” or “an immune-system virus” do not naturally fall out of descriptions of geological formations and atmospheric conditions. They’re highly contingent abstraction, ones that are very sensitive to the exact conditions in which they formed. (Biochemistry, the broad biosphere the system is embedded in...)
Similarly, things like “culture” or “eudaimonia” or “personal identity”, the way humans understand them, don’t easily fall out of even the abstractions present in the biosphere. They’re highly contingent on the particulars of how human minds and bodies are structured, how they exchange information, et cetera.
In particular: humans, despite being dropped into an abstraction-rich environment, did not learn values that just mirror some abstraction present in the environment. We’re not wrapper-minds single-mindedly pursuing procreation, or the eradication of predators, or the maximization of the number of stars. Similarly, animals don’t learn values like “compress gasses”.
What Beta creates are altogether new abstractions defined in terms of complicated mixes of Alpha’s abstractions. And if Beta is the sort of system that learns values, it learns values that wildly mix the abstractions present in Beta. These new abstractions are indeed then just some new natural abstraction. But they’re not necessarily “simple” in terms of Alpha’s abstractions.
And now we come to the question of what values an AGI would learn. I would posit that, on the current ML paradigm, the setup is the basic Alpha-and-Beta setup, with the human civilization being Alpha and the AGI being Beta.
Yes, there are some natural abstractions in Alpha, like “eudaimonia”. But to think that the AGI would just naturally latch onto that single natural abstraction, and define its entire value system over it, is analogous to thinking that animals would explicitly optimize for gas-compression, or humans for predator-elimination or procreation.
I instead strongly expect that the story would just repeat. The training process (or whatever process spits out the AGI) would end up creating some extremely specific conditions in which the AGI is learning the values. Its values would then necessarily be some complicated functions over weird mixes of the abstractions-natural-to-the-dataset-it’s-trained-on, with their specifics being highly contingent on some invisible-to-us details of that process.
It would not be just “eudaimonia”, it’d be some weird nonlinear function of eudaimonia and a random grab-bag of other things, including the “Beta-specific” abstractions that formed within the AGI over the course of training. And the output would not necessarily have anything to do with “eudaimonia” in any recognizable way, the way “avoid predators” is unrecognizable in terms of “rocks” and “aerodynamics”, and “human values” are unrecognizable in terms of “avoid predators” or “maximize children”.
I feel like the difference between the Alpha and Beta examples and my examples mediate through your examples having basically no control of Beta’s data at all, and my examples having far more control over what data is learned by the AI.
I think the key crux is whether we have much more control over AI data sources than evolution.
If I agreed with you that we would have essentially no control on what data the AI has, I’d be a lot more worried, but I don’t think this is true, and I expect future AIs, including AGIs, to be a lot more built than grown, and for a lot of their data to be very carefully controlled via synthetic data, for simple capabilities reasons, but this can also be used for alignment strategies.
I think another disagreement is I basically don’t buy the evolution analogy for DL, and I think there are some deep disanalogies (the big one for now is again how much more control over data sources than evolution, and this is only set to increase with synthetic data).
So I basically don’t expect this to happen:
I instead strongly expect that the story would just repeat. The training process (or whatever process spits out the AGI) would end up creating some extremely specific conditions in which the AGI is learning the values. Its values would then necessarily be some complicated functions over weird mixes of the abstractions-natural-to-the-dataset-it’s-trained-on, with their specifics being highly contingent on some invisible-to-us details of that process.
Pretty much all of your examples rely on the Alpha being unable to control the data learnt by Beta, and if this isn’t the case, your examples break down.
I don’t think the way you split things up into Alpha and Beta quite carves things at the joints. If you take an individual human as Beta, then stuff like “eudaimonia” is in Alpha—it’s a concept in the cultural environment that we get exposed to and sometimes come to value. The vast majority of an individual human’s values are not new abstractions that we develop over the course of our training process (for most people at least).
Basically people tend to value stuff they perceive in the biophysical environment and stuff they learn about through the social environment.
So that reduces the complexity of the problem—it’s not a matter of designing a learning algorithm that both derives and comes to value human abstractions from observations of gas particles or whatever. That’s not what humans do either.
Okay then, why aren’t we star-maximizers or number-of-nation-states maximizers? Obviously it’s not just a matter of learning about the concept. The details of how we get values hooked up to an AGI’s motivations will depend on the particular AGI design but probably reward, prompting, scaffolding or the like.
Eh, the way I phrased that statement, I’d actually meant that an AGI aligned to human values would also be a subject of AGI-doom arguments, in the sense that it’d exhibit instrumental convergence, power-seeking, et cetera. It wouldn’t do that in the domains where that’d be at odds with its values – for example, in cases where that’d be violating human agency —but that’s true of all other AGIs as well. (A paperclip-maximizer wouldn’t erase its memory of what “a paperclip” is to free up space for combat plans.)
In particular, that statement certainly weren’t intended as a claim that an aligned AGI is impossible. Just that its internal structure would likely be that of an embedded agent, and that if the free parameter of its values were changed, it’d be an extinction threat.
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they’re claims that any “artificial generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent.
These “coherence theorems” rest on very shaky ground. See this post. I don’t think it is appropriate to merely gesture at the existence of such theorems, as if they are widely accepted as unproblematic, without pointing to a specific one. There are other, informal, ways to make a similar point, but acting as if there are well-established, substantial proofs of the matter is not justified.
That post is clickbait. It only argues that the incompleteness money pump doesn’t work. The reasons the incompleteness money pump does work is well summarized at a high level here or here (more specifically here and here, if we get into details).
Admittedly, I can’t judge all the technical details. But I notice that neither So8res nor Wentworth have engaged with the EJT post (neither directly in the comments nor in their posts you have linked), despite being published later. And EJT’s engagement with the Wentworth post didn’t elicit much of a reaction either. So from an outside view, the viability of coherence arguments seems questionable.
My third link is down the thread from your link. I agree that from an outside view it’s difficulty to work out who is right. Unfortunately in this case one has to actually work through the details.
This is an interesting historical perspective… But it’s not really what the fundamental case for AGI doom routes through. In particular: AGI doom is not about “AI systems”, as such.
AGI doom is, specifically, about artificial generally intelligent systems capable of autonomously optimizing the world the way humans can, and who are more powerful at this task than humans. The AGI-doom arguments do not necessarily have anything to do with the current SoTA ML models.
Case in point: A manually written FPS bot is technically “an AI system”. However, I think you’d agree that the AGI-doom arguments were never about this type of system, despite it falling under the broad umbrella of “an AI system”.
Similarly, if a given SoTA ML model architecture fails to meet the definition of “a generally intelligent system capable of autonomously optimizing the world the way humans can”, then the AGI doom is not about it. The details of its workings, therefore, have little to say, one way or another, about the AGI doom.
Why are the AGI-doom concerns extended to the current AI-capabilities research, then, if the SoTA models don’t fall under said concerns? Well, because building artificial generally intelligent systems is something the AGI labs are specifically and deliberately trying to do. Inasmuch as the SoTA models are not the generally intelligent systems that are within the remit of the AGI-doom arguments, and are instead some other type of systems, the current AGI labs view this as their failure that they’re doing their best to “fix”.
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they’re claims that any “artificial generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
In other words: The set of AI systems defined by “a generally intelligent world-optimization-capable agent”, and the set of AI systems defined by “the subject of fundamental AGI-doom arguments”, is the same set of systems. You can’t have the former without the latter. And the AI industry wants the former; therefore, the arguments go, it will unleash the latter on the world.
While, yes, the current SoTA models are not subjects of the AGI doom arguments, that doesn’t matter, because the current SoTA models are incidental research artefacts that are produced on AI industry’s path to building an AGI. The AGI-doom arguments apply to the endpoint of that process, not the messy byproducts.
So any evidence we uncover about how the current models are not dangerous the way AGI-doom arguments predict AGIs to be dangerous, is just evidence that they’re not AGI yet. It’s not evidence that AGI would not be dangerous. (Again: FPS bots’ non-dangerousness isn’t evidence that AGI would be non-dangerous.)
(I’d written some more about this topic here. See also gwern’s Why Tool AIs Want to Be Agent AIs for more arguments regarding why AI research’s endpoint would be an AI agent, instead of something as harmless and compliant as the contemporary models.)
Counterarguments to AGI-doom arguments that focus on pointing to the SoTA models, as such, miss the point. Actual counterarguments would instead find some way to argue that “generally intelligent world-optimizing agents” and “subjects of AGI-doom arguments” are not the exact same type of system; that you can, in theory, have the former without the latter. I have not seen any such argument, and the mathematical noose around them is slowly tightening (uh, by which I mean: their impossibility may be formally provable).
There is a difference between the claim that powerful agents are approximately well-described as being expected utility maximizers (which may or may not be true) and the claim that AGI systems will have an explicit utility function the moment they’re turned on, and maximize that function from that moment on.
I think this is the assumption OP is pointing out: “most of the book’s discussion of AI risk frames the AI as having a certain set of goals from the moment it’s turned on, and ruthlessly pursuing those to the best of its ability”. “From the moment it’s turned on” is pretty important, because it rules out value learning as a solution
If you drop the “artificially” from the claim, you are left with a claim that any “generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Do you endorse that claim, or do think that there is some particular reason a biological or hybrid generally intelligent system capable of autonomously optimizing the world the way a human or an organization based on humans might not be well-approximated as a game-theoretic agent?
Because humans sure don’t seem like paperclipper-style utility maximizers to me.
Yes.
Humans are indeed hybrid systems. But I would say that inasmuch as they act as generally intelligent systems capable of autonomously optimizing the world in scarily powerful ways, they do act as game-theoretic agents. E. g., people who are solely focused on resource accumulation, and don’t have self-destructive vices or any distracting values they’re not willing to sacrifice to Moloch, tend to indeed accumulate power at a steady rate. At a smaller scope, people tend to succeed at those of their long-term goals that they’ve clarified for themselves and doggedly pursue; and not succeed at them if they flip-flop between different passions on a daily basis.
I’ve been meaning to do some sort of literature review solidly backing this claim, actually, but it hasn’t been a priority for me. Hmm, maybe it’d be easy with the current AI tools...
By “hybrid system” I actually meant “system composed of multiple humans plus external structure”, sorry if that was unclear. Concretely I’m thinking of things like “companies” and “countries”.
I don’t see how one gets from this observation to the conclusion that humans are well-approximated as paperclipper-style agents.
I suppose it may be worth stepping back to clarify that when I say “paperclipper-style agents”, I mean “utility maximizers whose utility function is a function of the configuration of matter at some specific time in the future”. That’s a super-finicky-sounding definition but my understanding is that you have to have a definition that looks like that if you want to use coherence theorems, and otherwise you end up saying that a rock is an agent that maximizes the utility function “behave like a rock”.
It does not seem to me that very many humans are trying to maximize the resources under their control at the time of their death, nor does it seem like the majority of the resources in the world are under the control of the few people who have decided to do that. It is the case that people who care at all about obtaining resources control a significant fraction of the resources, but I don’t see a trend where the people who care maximally about controlling resources actually control a lot more resources than the people who care somewhat about controlling resources, as long as they still have time to play a round of golf or do whatever else they enjoy.
I like this exchange and the clarifications on both sides. I’ll add my response:
You’re right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it’s multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions. But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren’t determined by near-term constraints, when longer-term goals dominate.
(a relevant part of Eliezer’s recent thread is “then probably one of those pieces runs over enough of the world-model (or some piece of reality causally downstream of enough of the world-model) that It can always do a little better by expending one more erg of energy.”, but it should be read in context)
Another missing piece here might be: The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there. This is the thing that makes agents useful and valuable. And it’s the main thing that separates agents from most other computer programs.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes. This does seem important in terms of alignment solutions. And it takes some steam out of the arguments that go “coherent therefore incorrigible” (or it at least should add some caveats). But this only helps us if we have a lot of control over the preferences&constraints of the agent, and it has a couple of stability properties.
Yeah, it feels like it’s getting at a crux between the “backchaining / coherence theorems / solve-for-the-equilibrium / law thinking” cluster of world models and the “OODA loop / shard theory / interpolate and extrapolate / toolbox thinking” cluster of world models.
Humans do seem to have strong preferences over immediate actions. For example, many people prefer not to lie, even if they think that lying will help them achieve their goals and they are confident that they will not get caught.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
I expect that agents which predictably behave in the way EY describes as “going hard” (i.e. attempting to achieve their long-term goal at any cost) will find it harder to find other agents who will cooperate with them. It’s not a binary choice between “care about process” and “care about outcomes”—it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
Yeah. Or strong preferences over processes (although I suppose you can frame a preference over process as a preference over there not being any intermediate time where the agent is actively executing some specific undesired behavior).
It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
I doubt that systems trained with ML techniques have these properties. But I don’t think e.g. humans or organizations built out of humans + scaffolding have these properties either, and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
Yeah same. Although legible commitments or decision theory can serve the same purpose better, it’s probably harder to evolve because it depends on higher intelligence to be useful. The level of transparency of agents to each other and to us seems to be an an important factor. Also there’s some equilibrium, e.g. in an overly honest society it pays to be a bit more dishonest, etc.
It does unfortunately seem easy and useful to learn rules like honest-to-tribe or honest-to-people-who-can-tell or honest-unless-it’s-really-important or honest-unless-I-can-definitely-get-away-with-it.
I think if you remove “at any cost”, it’s a more reasonable translation of “going hard”. It’s just attempting to achieve a long-term goal that is hard to achieve. I’m not sure what “at any cost” adds to it, but I keep on seeing people add it, or add monomaniacally, or ruthlessly. I think all of these are importing an intuition that shouldn’t be there. “Going hard” doesn’t mean throwing out your morality, or sacrificing things you don’t want to sacrifice. It doesn’t mean being selfish or unprincipled such that people don’t cooperate with you. That would defeat the whole point.
Yes!
No!
Yeah mostly true probably.
I’m talking about stability properties like “doesn’t accidentally radically change the definition of its goals when updating its world-model by making observations”. I agree properties like this don’t seem to be on the fastest path to build AGI.
Point of clarification: the type of strong preferences I’m referring to are more deontological-injunction shaped than they are habit-shaped. I expect that a preference not to exhibit the behavior of murdering people would not meaningfully hinder someone whose goal was to get very rich from achieving that goal. One could certainly imagine cases where the preference not to murder caused the person to be less likely to achieve their goals, but I don’t expect that it would be all that tightly binding of a constraint in practice, and so I don’t think pondering and philosophizing until they realize that they value one murder at exactly -$28,034,771.91 would meaningfully improve that person’s ability to get very rich.
I think there’s more to the Yudkowsky definition of “going hard” it than “attempting to achieve hard long-term goals”. Take for example:
My interpretation of the specific thing that made Mossad’s actions an instance of “going hard” here was that they took actions that most people would have thought of as “off limits” in the service of achieving their goal (and that doing so actually helped them achieve their goal (and that it actually worked out for them—we don’t generally say that Elizabeth Holmes “went hard” with Theranos). The supply chain attack in question does demonstrate significant technical expertise, but it also demonstrates a willingness to risk provoking parties that were uninvolved in the conflict in order to achieve their goals.
Perhaps instead of “attempting to achieve the goal at any cost” it would be better to say “being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal”.
I suspect we may be talking past each other here. Some of the specific things I observe:
RLHF works pretty well for getting LLMs to output text which is similar to text which has been previously rated as good, and dissimilar to text which has previously been rated as bad. It doesn’t generalize perfectly, but it does generalize well enough that you generally have to use adversarial inputs to get it to exhibit undesired behavior—we call them “jailbreaks” not “yet more instances of bomb creation instructions”.
Along those lines, RLAIF also seems to Just Work™.
And the last couple of years have been a parade of “the dumbest possible approach works great actually” results, e.g.
“Sure, fine-tuning works, but what happens if we just pick a few thousand weights and only change those weights, and leave the rest alone?” (Answer: it works great)
“I want outputs that are more like thing A and less like thing B, but I don’t want to spend a lot of compute on fine tuning. Can I just compute both sets of activations and subtract the one from the other?” (Answer: Yep!)
“Can I ask it to write me a web application from a vague natural language description, and have it make reasonable choices about all the things I didn’t specify” (Answer: astonishing amounts of yes)
Take your pick of the top chat-tuned LLMs. If you ask it about a situation and ask what a good course of action would be, it will generally give you pretty sane answers.
So from that, I conclude:
We have LLMs which understand human values, and can pretty effectively judge how good things are according to those values, and output those judgements in a machine-readable format
We are able to tune LLMs to generate outputs that are more like the things we rate as good and less like the things we rate as bad
Put that together and that says that, at least at the level of LLMs, we do in fact have AIs which understand human morality and care about it to the extent that “care about” is even the correct abstraction for the kind of thing they do.
I expect this to continue to be true in the future, and I expect that our toolbox will get better faster than the AIs that we’re training get more capable.
What observations lead you to suspect that this is a likely failure mode?
Yeah I’m on board with deontological-injunction shaped constraints. See here for example.
Nah I still disagree. I think part of why I’m interpreting the words differently is because I’ve seen them used in a bunch of places e.g. the lightcone handbook to describe the lightcone team. And to describe the culture of some startups (in a positively valenced way).
Being willing to be creative and unconventional—sure, but this is just part of being capable and solving previously unsolved problems. But disregarding conventions that are important for cooperation that you need to achieve your goals? That’s ridiculous.
Being willing to impose costs on uninvolved parties can’t be what is implied by ‘going hard’ because that depends on the goals. An agent that cares a lot about uninvolved parties can still go hard at achieving its goals.
Unfortunately we are not. I appreciate the effort you put into writing that out, but that is the pattern that I understood you were talking about, I just didn’t have time to write out why I disagreed.
This is the main point where I disagree. The reason I don’t buy the extrapolation is that there are some (imo fairly obvious) differences between current tech and human-level researcher intelligence, and those differences appear like they should strongly interfere with naive extrapolation from current tech. Tbh I thought things like o1 or alphaproof might cause the people who naively extrapolate from LLMs to notice some of these, because I thought they were simply overanchoring on current SoTA, and since the SoTA has changed I thought they would update fast. But it doesn’t seem to have happened much yet. I am a little confused by this.
I didn’t say likely, it’s more an example of an issue that comes up so far when I try to design ways to solve other problems. Maybe see here for instabilities in trained systems, or here for more about that particular problem.
I’m going to drop out of this conversation now, but it’s been good, thanks! I think there are answers to a bunch of your claims in my misalignment and catastrophe post.
I largely concur, but I think the argument is simpler and more intuitive. I want to boil this down a little and try to state it in plainer language:
Arguments for doom as a default apply to any AI that has unbounded goals and pursues those goals more competently than humans. Maximization, coherence, etc. are not central pieces.
Current AI doesn’t really have goals, so it’s not what we’re worried about. But we’ll give AI goals, because we want agents to get stuff done for us, and giving them goals seems necessary for that. All of the concerns for the doom argument will apply to real AI soon enough.
However, current AI systems may suggest a route to AGI that dodges soom of the more detailed doom arguments. Their relative lack of inherent goal-directedness and relative skill at following instructions true to their intent (and the human values behind them) may be cause for guarded optimism. One of my attempts to explain this is The (partial) fallacy of dumb superintelligence.
In a different form, the doom as default argument is:
IF an agent is smarter/more competent than you and
Has goals that conflict with yours
It will outsmart you somehow, eventually (probably soon)
It will achieve its goals and you will correspondingly not achieve yours
If its goals are unbounded and it “cares” about your goals near zero,
You will lose everything
Arguments that “we’re training it on human data so it will care about our values above zero” are extremely speculative. They could be true, but betting the future of humanity on it without thinking it through seems very, very foolish.
That’s my attempt at the simplest form of the doom by default argument
Just to point out the one distinction: I make no reference to game theoretic agents or coherence theorems. I think this are unnecessary distractions to the core argument. An agent that has weird and conflicting goals (and so isn’t coherent or a perfect game theoretic agent) will still take all of your stuff if its set of goals and values don’t weigh human property rights or human wellbeing very highly. That’s why we take the alignment problem to be the central problem in surviving AGI.
The other question implicit in this post was, why would we make AI less safe than current systems, which would remain pretty safe even if they were a lot smarter.
Asking why in the world humans would make AI with its own goals is like asking why in the world we’d create dynamite, much less nukes: because it will help humans accomplish their goals, until it doesn’t; and it’s as easy as calling your safe oracle AI (e.g., really good LLM) repeatedly with “what would an agent trying to accomplish X do with access to tools Y?” and passing the output to those tools. Agency is a one-line extension, and we’re not going to just not bother.
I like your comment, but I do want to comment this:
has evidence against it, fortunately for us.
I summarize the evidence for the pretty large similarities between the human brain and current DL systems, and this allows us to transport insights from AI into neuroscience, and vice versa here:
https://x.com/SharmakeFarah14/status/1837528997556568523
But the point here is that one of the lessons from AI that is likely to transfer over to human values is that the data matters way more than the algorithm, optimizer, architecture, or hyperparameter choices.
I don’t go as far as this link does in claiming that the it in AI models is the data set, but I think a weaker version of this is basically right, and thus the bitter lesson holds for human values too:
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
Other than that quote, I basically agree with the rest of your helpful comment here.
While 2024 SoTA models are not capable of autonomously optimizing the world, they are really smart, perhaps 1⁄2 or 2⁄3 of the way there, and already beginning to make big impacts on the economy. As I said in response to your original post, because we don’t have 100% confidence in the coherence arguments, we should take observations about the coherence level of 2024 systems as evidence about how coherent the 203X autonomous corporations will need to be. Evidence that 2024 systems are not dangerous is both evidence that they are not AGI and evidence that AGI need not be dangerous.
I would agree with you if the coherence arguments were specifically about autonomously optimizing the world and not about autonomously optimizing a Go game or writing 100-line programs, but this doesn’t seem to be the case.
This is just a conjecture, and there has not really been significant progress on the agent-like structure conjecture. I don’t think it’s fair to say we’re making good progress on a proof.
This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior? In this world what the believers in coherence really need to show is that almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence. Then for the argument to carry through you need to show they are also high on some metric of incorrigibility, or fragile to value misspecification. None of the classic coherence results quite hit this.
However AFAIK @Jeremy Gillen does not think we can get an argument with exactly this structure (the main argument in his writeup is a bit different), and Eliezer has historically and recently made the argument that EU maximization is simple and natural. So maybe you do need this argument that an EU maximization algorithm is simpler than other algorithms, which seems like it needs some clever way to formalize it, because proving things about the space of all simple programs seems too hard.
I think you are mischaracterizing my beliefs here.
This seems right to me. Maybe see my comment further up, I think it’s relevant to arguments we’ve had before.
We can’t say much about the detailed internal structure of an agent, because there’s always a lot of ways to implement an algorithm. But we do only care about (generalizing) behavior, so we only need some very abstract properties relevant to that.
This is my crux with people who have 90+% P(doom): will vNM expected utility maximization be a good approximation of the behavior of TAI? You argue that it will, but I expect that it won’t.
My thinking related to this crux is informed less by the behaviors of current AI systems (although they still influence it to some extent) than by the failure of the agent foundations agenda. The dream 10 years ago was that if we started by modeling AGI as an vNM expected utility maximizer, and then gradually added more and more details to our model to account for differences between the idealized model and real-world AI systems, we would end up with an accurate theoretical system for predicting the behaviors AGI would exhibit. It would be a similar process to how physicists start with an idealized problem setup and add in details like friction or relativistic corrections.
But that isn’t what ended up happening. Agent foundations researchers ended up getting stuck on the cluster of problems collectively described as embedded agency, unable to square the dualistic assumptions of expected utility theory and Bayesianism with the embedded structure of real-world AI systems. The sub-problems of embedded agency are many and too varied to allow one elegant theorem to fix everything. Instead, they point to a fundamental flaw in the expected utility maximizer model, suggesting that it isn’t as widely applicable as early AI safety researchers thought.
The failure of the agent foundations agenda has led me to believe that expected utility maximization is only a good approximation for mostly-unembedded systems, and that an accurate theoretical model of advanced AI behavior (if such a thing is possible) would require a fundamentally different, less dualistic set of concepts. Coherence theorems and decision-theoretic arguments still rely on the old, unembedded assumptions and therefore don’t provide an accurate predictive model.
I agree that the agent-foundations research has been somewhat misaimed from the start, but I buy this explanation of John’s regarding where it went wrong and how to fix it. Basically, what we need to figure out is a theory of embedded world-modeling, which would capture the aspect of reality where the universe naturally decomposes into hierarchically arranged sparsely interacting subsystems. Our agent would then be a perfect game-theoretic agent, but defined over that abstract (and lazy) world-model, rather than over the world directly.
This would take care of agents needing to be “bigger” than the universe, counterfactuals, the “outside-view” problem, the realizability and the self-reference problems, the problem of hypothesis spaces, and basically everything else that’s problematic about embedded agency.
A theory of embedded world-modeling would be an improvement over current predictive models of advanced AI behavior, but it wouldn’t be the whole story. Game theory makes dualistic assumptions too (e.g., by treating the decision process as not having side effects), so we would also have to rewrite it into an embedded model of motivation.
Cartesian frames are one of the few lines of agent foundations research in the past few years that seem promising, due to allowing for greater flexibility in defining agent-environment boundaries. Preferably, we would have a model that lets us avoid having to postulate an agent-environment boundary at all. Combining a successor to Cartesian frames with an embedded theory of motivation, likely some form of active inference, might give us an accurate overarching theory of embedded behavior.
It turns out in an idealized model of intelligent AI, we can remove the dualistic assumptions of game theory by instead positing a reflective oracle, and the reflective oracle is allowed randomness in the territory (it is not just uncertainty in the map) to prevent paradoxes, and in particular the reflective oracle’s randomized answers are exactly the Nash-Equilibria of game theory, because there is a one-to-one function between a reflective oracle and a Nash-equilibrium.
Of course, whether it can transfer to our reality at all is pretty sketchy at best, but at least there is a solution at all:
https://arxiv.org/abs/1508.04145
The reflective oracle model doesn’t have all the properties I’m looking for—it still has the problem of treating utility as the optimization target rather than as a functional component of an iterative behavior reinforcement process. It also treats the utilities of different world-states as known ahead of time, rather than as the result of a search process, and assumes that computation is cost-free. To get a fully embedded theory of motivation, I expect that you would need something fundamentally different from classical game theory. For example, it probably wouldn’t use utility functions.
Re treating utility as the optimization target, I think this isn’t properly speaking an embedded agency problem, but rather an empirical problem of what the first AIs that automate everything will look like algorithmically, as there are algorithms that are able to be embedded in reality that do optimize the utility/reward like MCTS, and TurnTrout limits the post to the model-free policy gradient case like PPO and REINFORCE.
TurnTrout is correct to point out that not all RL algorithms optimize for the reward, and reward isn’t what the agent optimizes for by definition, but I think that it’s too limited in describing when RL does optimize for the utility/reward.
So I think the biggest difference between @TurnTrout and people like @gwern et al is whether or not model-based RL that does plan or model-free RL policy gradient algorithms come to dominate AI progress over the next decade.
Agree that the fact that it treats utilities of different world states as known and that the cost of computation is free makes it a very unrealistic model for human beings, and while something like the reflective oracle model is a possibility if we warped the laws of physics severely enough, such that we don’t have to care about the cost of computation at all, which then allows us to go from treating utilities as unknown to known in 1 step, this is an actual reason why I don’t expect the reflective oracle model to transfer to reality at all.
We could maybe weaken this requirement? Perhaps it would suffice to show/argue that it’s feasible[1] to build any kind of “acute risk period -ending AI”[2] that is not a “subject of AGI-doom arguments”?
I’d be (very) curious to see such arguments. [3]
within time constraints, before anyone else builds a “subject of AGI-doom arguments”
or, “AIs that implement humanity’s CEV”
If I became convinced that it’s feasible to build such a “pivotal AI” that is not “subject to AGI doom arguments”, I think that would shift a bunch of my probability mass from “we die due to unaligned AI” to “we die-or-worse due to misaligned humans controlling ASI” and “utopia”.
My short answer is that the argument would consist of human values are quite simple and are most likely a reasonably natural abstraction, and the felt complexity is due to adding both the complexity of the generators and the data, which people wouldn’t do for AI capabilities, meaning the bitter lesson holds for human values and morals as well.
Also, the way AI is aligned depends far more on the data that is given and our control over synthetic data means we can get AIs that follow human values before it gets too capable to take over everything, and evolutionary psychology mispredicted this and the above point pretty hard, making it lose many Bayes points compared to the Universal Learning Machine/Blank Slate hypotheses.
Alignment generalizes further than capabilities for pretty deep reasons, contra Nate Soares but basically it’s way easier to have an AI care about human values than it is to get it to be capable in real-world domains, combined with verification being easier than generation.
Finally, there is evidence that AIs are far more robust to errors than people thought 15-20 years ago.
In essence, it’s a negation of the following:
Fragility and Complexity of Value
Pretty much all of evolutionary psychology literature.
Capabilities generalizing further than alignment.
The Sharp Left Turn.
Mm, there are two somewhat different definitions of what counts as “a natural abstraction”:
I would agree that human values are likely a natural abstraction in the sense that if you point an abstraction-learning algorithm at the dataset of modern humans doing things, “human values” and perhaps even “eudaimonia” would fall out as a natural principal component of that dataset’s decomposition.
What I wouldn’t agree with is that human values are a natural abstraction in the sense that a mind pointed at the dataset of this universe doing things, or at the dataset of animals doing things, or even at the dataset of prehistoric or medieval humans doing things, would learn modern human values.
Let’s step back a bit.
Suppose we have a system Alpha and a system Beta, with Beta embedded in Alpha. Alpha starts out with a set of natural abstractions/subsystems. Beta, if it’s an embedded agent, learns these abstractions, and then starts executing actions within Alpha that alter its embedding environment. Over the course of that, Beta creates new subsystems, corresponding to new abstractions.
As concrete examples, you can imagine:
The lifeless universe as Alpha (with abstractions like “stars”, “gasses”, “seas”), and the biosphere as Beta (creating abstractions like “organisms” and “ecosystems” and “predator” and “prey”).
The biosphere as Alpha (with abstractions like “food” and “species”) and the human civilization as Beta (with abstractions like “luxury” and “love” and “culture”).
Notice one important fact: the abstractions Beta creates are not, in general, easy-to-predict from the abstractions already in Alpha. “A multicellular organism” or “an immune-system virus” do not naturally fall out of descriptions of geological formations and atmospheric conditions. They’re highly contingent abstraction, ones that are very sensitive to the exact conditions in which they formed. (Biochemistry, the broad biosphere the system is embedded in...)
Similarly, things like “culture” or “eudaimonia” or “personal identity”, the way humans understand them, don’t easily fall out of even the abstractions present in the biosphere. They’re highly contingent on the particulars of how human minds and bodies are structured, how they exchange information, et cetera.
In particular: humans, despite being dropped into an abstraction-rich environment, did not learn values that just mirror some abstraction present in the environment. We’re not wrapper-minds single-mindedly pursuing procreation, or the eradication of predators, or the maximization of the number of stars. Similarly, animals don’t learn values like “compress gasses”.
What Beta creates are altogether new abstractions defined in terms of complicated mixes of Alpha’s abstractions. And if Beta is the sort of system that learns values, it learns values that wildly mix the abstractions present in Beta. These new abstractions are indeed then just some new natural abstraction. But they’re not necessarily “simple” in terms of Alpha’s abstractions.
And now we come to the question of what values an AGI would learn. I would posit that, on the current ML paradigm, the setup is the basic Alpha-and-Beta setup, with the human civilization being Alpha and the AGI being Beta.
Yes, there are some natural abstractions in Alpha, like “eudaimonia”. But to think that the AGI would just naturally latch onto that single natural abstraction, and define its entire value system over it, is analogous to thinking that animals would explicitly optimize for gas-compression, or humans for predator-elimination or procreation.
I instead strongly expect that the story would just repeat. The training process (or whatever process spits out the AGI) would end up creating some extremely specific conditions in which the AGI is learning the values. Its values would then necessarily be some complicated functions over weird mixes of the abstractions-natural-to-the-dataset-it’s-trained-on, with their specifics being highly contingent on some invisible-to-us details of that process.
It would not be just “eudaimonia”, it’d be some weird nonlinear function of eudaimonia and a random grab-bag of other things, including the “Beta-specific” abstractions that formed within the AGI over the course of training. And the output would not necessarily have anything to do with “eudaimonia” in any recognizable way, the way “avoid predators” is unrecognizable in terms of “rocks” and “aerodynamics”, and “human values” are unrecognizable in terms of “avoid predators” or “maximize children”.
I feel like the difference between the Alpha and Beta examples and my examples mediate through your examples having basically no control of Beta’s data at all, and my examples having far more control over what data is learned by the AI.
I think the key crux is whether we have much more control over AI data sources than evolution.
If I agreed with you that we would have essentially no control on what data the AI has, I’d be a lot more worried, but I don’t think this is true, and I expect future AIs, including AGIs, to be a lot more built than grown, and for a lot of their data to be very carefully controlled via synthetic data, for simple capabilities reasons, but this can also be used for alignment strategies.
I think another disagreement is I basically don’t buy the evolution analogy for DL, and I think there are some deep disanalogies (the big one for now is again how much more control over data sources than evolution, and this is only set to increase with synthetic data).
So I basically don’t expect this to happen:
Pretty much all of your examples rely on the Alpha being unable to control the data learnt by Beta, and if this isn’t the case, your examples break down.
I don’t think the way you split things up into Alpha and Beta quite carves things at the joints. If you take an individual human as Beta, then stuff like “eudaimonia” is in Alpha—it’s a concept in the cultural environment that we get exposed to and sometimes come to value. The vast majority of an individual human’s values are not new abstractions that we develop over the course of our training process (for most people at least).
Basically people tend to value stuff they perceive in the biophysical environment and stuff they learn about through the social environment.
So that reduces the complexity of the problem—it’s not a matter of designing a learning algorithm that both derives and comes to value human abstractions from observations of gas particles or whatever. That’s not what humans do either.
Okay then, why aren’t we star-maximizers or number-of-nation-states maximizers? Obviously it’s not just a matter of learning about the concept. The details of how we get values hooked up to an AGI’s motivations will depend on the particular AGI design but probably reward, prompting, scaffolding or the like.
Eh, the way I phrased that statement, I’d actually meant that an AGI aligned to human values would also be a subject of AGI-doom arguments, in the sense that it’d exhibit instrumental convergence, power-seeking, et cetera. It wouldn’t do that in the domains where that’d be at odds with its values – for example, in cases where that’d be violating human agency —but that’s true of all other AGIs as well. (A paperclip-maximizer wouldn’t erase its memory of what “a paperclip” is to free up space for combat plans.)
In particular, that statement certainly weren’t intended as a claim that an aligned AGI is impossible. Just that its internal structure would likely be that of an embedded agent, and that if the free parameter of its values were changed, it’d be an extinction threat.
These “coherence theorems” rest on very shaky ground. See this post. I don’t think it is appropriate to merely gesture at the existence of such theorems, as if they are widely accepted as unproblematic, without pointing to a specific one. There are other, informal, ways to make a similar point, but acting as if there are well-established, substantial proofs of the matter is not justified.
That post is clickbait. It only argues that the incompleteness money pump doesn’t work. The reasons the incompleteness money pump does work is well summarized at a high level here or here (more specifically here and here, if we get into details).
Admittedly, I can’t judge all the technical details. But I notice that neither So8res nor Wentworth have engaged with the EJT post (neither directly in the comments nor in their posts you have linked), despite being published later. And EJT’s engagement with the Wentworth post didn’t elicit much of a reaction either. So from an outside view, the viability of coherence arguments seems questionable.
My third link is down the thread from your link. I agree that from an outside view it’s difficulty to work out who is right. Unfortunately in this case one has to actually work through the details.