I like this exchange and the clarifications on both sides.
Yeah, it feels like it’s getting at a crux between the “backchaining / coherence theorems / solve-for-the-equilibrium / law thinking” cluster of world models and the “OODA loop / shard theory / interpolate and extrapolate / toolbox thinking” cluster of world models.
You’re right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it’s multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions
Humans do seem to have strong preferences over immediate actions. For example, many people prefer not to lie, even if they think that lying will help them achieve their goals and they are confident that they will not get caught.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren’t determined by near-term constraints, when longer-term goals dominate. [...] The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there.
I expect that agents which predictably behave in the way EY describes as “going hard” (i.e. attempting to achieve their long-term goal at any cost) will find it harder to find other agents who will cooperate with them. It’s not a binary choice between “care about process” and “care about outcomes”—it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes.
Yeah. Or strong preferences over processes (although I suppose you can frame a preference over process as a preference over there not being any intermediate time where the agent is actively executing some specific undesired behavior).
But this only helps us if we have a lot of control over the preferences&constraints of the agent,
It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
and it has a couple of stability properties.
I doubt that systems trained with ML techniques have these properties. But I don’t think e.g. humans or organizations built out of humans + scaffolding have these properties either, and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
Humans do seem to have strong preferences over immediate actions.
I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
Yeah same. Although legible commitments or decision theory can serve the same purpose better, it’s probably harder to evolve because it depends on higher intelligence to be useful. The level of transparency of agents to each other and to us seems to be an an important factor. Also there’s some equilibrium, e.g. in an overly honest society it pays to be a bit more dishonest, etc.
It does unfortunately seem easy and useful to learn rules like honest-to-tribe or honest-to-people-who-can-tell or honest-unless-it’s-really-important or honest-unless-I-can-definitely-get-away-with-it.
attempting to achieve their long-term goal at any cost
I think if you remove “at any cost”, it’s a more reasonable translation of “going hard”. It’s just attempting to achieve a long-term goal that is hard to achieve. I’m not sure what “at any cost” adds to it, but I keep on seeing people add it, or add monomaniacally, or ruthlessly. I think all of these are importing an intuition that shouldn’t be there. “Going hard” doesn’t mean throwing out your morality, or sacrificing things you don’t want to sacrifice. It doesn’t mean being selfish or unprincipled such that people don’t cooperate with you. That would defeat the whole point.
It’s not a binary choice between “care about process” and “care about outcomes”—it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
Yes!
It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
No!
I doubt that systems trained with ML techniques have these properties. But I don’t think e.g. humans or organizations built out of humans + scaffolding have these properties either
Yeah mostly true probably.
and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
I’m talking about stability properties like “doesn’t accidentally radically change the definition of its goals when updating its world-model by making observations”. I agree properties like this don’t seem to be on the fastest path to build AGI.
uspect[faul_sname] Humans do seem to have strong preferences over immediate actions.
[Jeremy Gillen] I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
Point of clarification: the type of strong preferences I’m referring to are more deontological-injunction shaped than they are habit-shaped. I expect that a preference not to exhibit the behavior of murdering people would not meaningfully hinder someone whose goal was to get very rich from achieving that goal. One could certainly imagine cases where the preference not to murder caused the person to be less likely to achieve their goals, but I don’t expect that it would be all that tightly binding of a constraint in practice, and so I don’t think pondering and philosophizing until they realize that they value one murder at exactly -$28,034,771.91 would meaningfully improve that person’s ability to get very rich.
I think if you remove “at any cost”, it’s a more reasonable translation of “going hard”. It’s just attempting to achieve a long-term goal that is hard to achieve.
I think there’s more to the Yudkowsky definition of “going hard” it than “attempting to achieve hard long-term goals”. Take for example:
@ESYudkowsky Mossad is much more clever and powerful than novices implicitly imagine a “superintelligence” will be; in the sense that, when novices ask themselves what a “superintelligence” will be able to do, they fall well short of the actual Mossad.
@ESYudkowsky Why? Because Mossad goes hard; and people who don’t go hard themselves, have no simple mental motion they can perform—no simple switch they can access—to imagine what it is actually like to go hard; and what options become available even to a mere human when you do.
My interpretation of the specific thing that made Mossad’s actions an instance of “going hard” here was that they took actions that most people would have thought of as “off limits” in the service of achieving their goal (and that doing so actually helped them achieve their goal (and that it actually worked out for them—we don’t generally say that Elizabeth Holmes “went hard” with Theranos). The supply chain attack in question does demonstrate significant technical expertise, but it also demonstrates a willingness to risk provoking parties that were uninvolved in the conflict in order to achieve their goals.
Perhaps instead of “attempting to achieve the goal at any cost” it would be better to say “being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal”.
[faul_sname] It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
[Jeremy Gillen] No!
I suspect we may be talking past each other here. Some of the specific things I observe:
RLHF works pretty well for getting LLMs to output text which is similar to text which has been previously rated as good, and dissimilar to text which has previously been rated as bad. It doesn’t generalize perfectly, but it does generalize well enough that you generally have to use adversarial inputs to get it to exhibit undesired behavior—we call them “jailbreaks” not “yet more instances of bomb creation instructions”.
Along those lines, RLAIF also seems to Just Work™.
And the last couple of years have been a parade of “the dumbest possible approach works great actually” results, e.g.
“Sure, fine-tuning works, but what happens if we just pick a few thousand weights and only change those weights, and leave the rest alone?” (Answer: it works great)
“I want outputs that are more like thing A and less like thing B, but I don’t want to spend a lot of compute on fine tuning. Can I just compute both sets of activations and subtract the one from the other?” (Answer: Yep!)
“Can I ask it to write me a web application from a vague natural language description, and have it make reasonable choices about all the things I didn’t specify” (Answer: astonishing amounts of yes)
Take your pick of the top chat-tuned LLMs. If you ask it about a situation and ask what a good course of action would be, it will generally give you pretty sane answers.
So from that, I conclude:
We have LLMs which understand human values, and can pretty effectively judge how good things are according to those values, and output those judgements in a machine-readable format
We are able to tune LLMs to generate outputs that are more like the things we rate as good and less like the things we rate as bad
Put that together and that says that, at least at the level of LLMs, we do in fact have AIs which understand human morality and care about it to the extent that “care about” is even the correct abstraction for the kind of thing they do.
I expect this to continue to be true in the future, and I expect that our toolbox will get better faster than the AIs that we’re training get more capable.
I’m talking about stability properties like “doesn’t accidentally radically change the definition of its goals when updating its world-model by making observations”.
What observations lead you to suspect that this is a likely failure mode?
Yeah I’m on board with deontological-injunction shaped constraints. See here for example.
Perhaps instead of “attempting to achieve the goal at any cost” it would be better to say “being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal”.
Nah I still disagree. I think part of why I’m interpreting the words differently is because I’ve seen them used in a bunch of places e.g. the lightcone handbook to describe the lightcone team. And to describe the culture of some startups (in a positively valenced way).
Being willing to be creative and unconventional—sure, but this is just part of being capable and solving previously unsolved problems. But disregarding conventions that are important for cooperation that you need to achieve your goals? That’s ridiculous.
Being willing to impose costs on uninvolved parties can’t be what is implied by ‘going hard’ because that depends on the goals. An agent that cares a lot about uninvolved parties can still go hard at achieving its goals.
I suspect we may be talking past each other here.
Unfortunately we are not. I appreciate the effort you put into writing that out, but that is the pattern that I understood you were talking about, I just didn’t have time to write out why I disagreed.
I expect this to continue to be true in the future
This is the main point where I disagree. The reason I don’t buy the extrapolation is that there are some (imo fairly obvious) differences between current tech and human-level researcher intelligence, and those differences appear like they should strongly interfere with naive extrapolation from current tech. Tbh I thought things like o1 or alphaproof might cause the people who naively extrapolate from LLMs to notice some of these, because I thought they were simply overanchoring on current SoTA, and since the SoTA has changed I thought they would update fast. But it doesn’t seem to have happened much yet. I am a little confused by this.
What observations lead you to suspect that this is a likely failure mode?
I didn’t say likely, it’s more an example of an issue that comes up so far when I try to design ways to solve other problems. Maybe see here for instabilities in trained systems, or here for more about that particular problem.
I’m going to drop out of this conversation now, but it’s been good, thanks! I think there are answers to a bunch of your claims in my misalignment and catastrophe post.
Yeah, it feels like it’s getting at a crux between the “backchaining / coherence theorems / solve-for-the-equilibrium / law thinking” cluster of world models and the “OODA loop / shard theory / interpolate and extrapolate / toolbox thinking” cluster of world models.
Humans do seem to have strong preferences over immediate actions. For example, many people prefer not to lie, even if they think that lying will help them achieve their goals and they are confident that they will not get caught.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
I expect that agents which predictably behave in the way EY describes as “going hard” (i.e. attempting to achieve their long-term goal at any cost) will find it harder to find other agents who will cooperate with them. It’s not a binary choice between “care about process” and “care about outcomes”—it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
Yeah. Or strong preferences over processes (although I suppose you can frame a preference over process as a preference over there not being any intermediate time where the agent is actively executing some specific undesired behavior).
It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
I doubt that systems trained with ML techniques have these properties. But I don’t think e.g. humans or organizations built out of humans + scaffolding have these properties either, and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
Yeah same. Although legible commitments or decision theory can serve the same purpose better, it’s probably harder to evolve because it depends on higher intelligence to be useful. The level of transparency of agents to each other and to us seems to be an an important factor. Also there’s some equilibrium, e.g. in an overly honest society it pays to be a bit more dishonest, etc.
It does unfortunately seem easy and useful to learn rules like honest-to-tribe or honest-to-people-who-can-tell or honest-unless-it’s-really-important or honest-unless-I-can-definitely-get-away-with-it.
I think if you remove “at any cost”, it’s a more reasonable translation of “going hard”. It’s just attempting to achieve a long-term goal that is hard to achieve. I’m not sure what “at any cost” adds to it, but I keep on seeing people add it, or add monomaniacally, or ruthlessly. I think all of these are importing an intuition that shouldn’t be there. “Going hard” doesn’t mean throwing out your morality, or sacrificing things you don’t want to sacrifice. It doesn’t mean being selfish or unprincipled such that people don’t cooperate with you. That would defeat the whole point.
Yes!
No!
Yeah mostly true probably.
I’m talking about stability properties like “doesn’t accidentally radically change the definition of its goals when updating its world-model by making observations”. I agree properties like this don’t seem to be on the fastest path to build AGI.
Point of clarification: the type of strong preferences I’m referring to are more deontological-injunction shaped than they are habit-shaped. I expect that a preference not to exhibit the behavior of murdering people would not meaningfully hinder someone whose goal was to get very rich from achieving that goal. One could certainly imagine cases where the preference not to murder caused the person to be less likely to achieve their goals, but I don’t expect that it would be all that tightly binding of a constraint in practice, and so I don’t think pondering and philosophizing until they realize that they value one murder at exactly -$28,034,771.91 would meaningfully improve that person’s ability to get very rich.
I think there’s more to the Yudkowsky definition of “going hard” it than “attempting to achieve hard long-term goals”. Take for example:
My interpretation of the specific thing that made Mossad’s actions an instance of “going hard” here was that they took actions that most people would have thought of as “off limits” in the service of achieving their goal (and that doing so actually helped them achieve their goal (and that it actually worked out for them—we don’t generally say that Elizabeth Holmes “went hard” with Theranos). The supply chain attack in question does demonstrate significant technical expertise, but it also demonstrates a willingness to risk provoking parties that were uninvolved in the conflict in order to achieve their goals.
Perhaps instead of “attempting to achieve the goal at any cost” it would be better to say “being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal”.
I suspect we may be talking past each other here. Some of the specific things I observe:
RLHF works pretty well for getting LLMs to output text which is similar to text which has been previously rated as good, and dissimilar to text which has previously been rated as bad. It doesn’t generalize perfectly, but it does generalize well enough that you generally have to use adversarial inputs to get it to exhibit undesired behavior—we call them “jailbreaks” not “yet more instances of bomb creation instructions”.
Along those lines, RLAIF also seems to Just Work™.
And the last couple of years have been a parade of “the dumbest possible approach works great actually” results, e.g.
“Sure, fine-tuning works, but what happens if we just pick a few thousand weights and only change those weights, and leave the rest alone?” (Answer: it works great)
“I want outputs that are more like thing A and less like thing B, but I don’t want to spend a lot of compute on fine tuning. Can I just compute both sets of activations and subtract the one from the other?” (Answer: Yep!)
“Can I ask it to write me a web application from a vague natural language description, and have it make reasonable choices about all the things I didn’t specify” (Answer: astonishing amounts of yes)
Take your pick of the top chat-tuned LLMs. If you ask it about a situation and ask what a good course of action would be, it will generally give you pretty sane answers.
So from that, I conclude:
We have LLMs which understand human values, and can pretty effectively judge how good things are according to those values, and output those judgements in a machine-readable format
We are able to tune LLMs to generate outputs that are more like the things we rate as good and less like the things we rate as bad
Put that together and that says that, at least at the level of LLMs, we do in fact have AIs which understand human morality and care about it to the extent that “care about” is even the correct abstraction for the kind of thing they do.
I expect this to continue to be true in the future, and I expect that our toolbox will get better faster than the AIs that we’re training get more capable.
What observations lead you to suspect that this is a likely failure mode?
Yeah I’m on board with deontological-injunction shaped constraints. See here for example.
Nah I still disagree. I think part of why I’m interpreting the words differently is because I’ve seen them used in a bunch of places e.g. the lightcone handbook to describe the lightcone team. And to describe the culture of some startups (in a positively valenced way).
Being willing to be creative and unconventional—sure, but this is just part of being capable and solving previously unsolved problems. But disregarding conventions that are important for cooperation that you need to achieve your goals? That’s ridiculous.
Being willing to impose costs on uninvolved parties can’t be what is implied by ‘going hard’ because that depends on the goals. An agent that cares a lot about uninvolved parties can still go hard at achieving its goals.
Unfortunately we are not. I appreciate the effort you put into writing that out, but that is the pattern that I understood you were talking about, I just didn’t have time to write out why I disagreed.
This is the main point where I disagree. The reason I don’t buy the extrapolation is that there are some (imo fairly obvious) differences between current tech and human-level researcher intelligence, and those differences appear like they should strongly interfere with naive extrapolation from current tech. Tbh I thought things like o1 or alphaproof might cause the people who naively extrapolate from LLMs to notice some of these, because I thought they were simply overanchoring on current SoTA, and since the SoTA has changed I thought they would update fast. But it doesn’t seem to have happened much yet. I am a little confused by this.
I didn’t say likely, it’s more an example of an issue that comes up so far when I try to design ways to solve other problems. Maybe see here for instabilities in trained systems, or here for more about that particular problem.
I’m going to drop out of this conversation now, but it’s been good, thanks! I think there are answers to a bunch of your claims in my misalignment and catastrophe post.