Agency/consequentialism is not a single property.
It bothers me that people still ask the simplistic question “will AGI be agentic and consequentialist by default, or will it be a collection of shallow heuristics?”. A consequentialist utility maximizer is just a mind with a bunch of properties that tend to make it capable, incorrigible, and dangerous. These properties can exist independently, and the first AGI probably won’t have all of them, so we should be precise about what we mean by “agency”. Off the top of my head, here are just some of the qualities included in agency:
Consequentialist goals that seem to be about the real world rather than a model/domain
Complete preferences between any pair of worldstates
Tends to cause impacts disproportionate to the size of the goal (no low impact preference)
Resists shutdown
Inclined to gain power (especially for instrumental reasons)
Goals are unpredictable or unstable (like instrumental goals that come from humans’ biological drives)
Goals usually change due to internal feedback, and it’s difficult for humans to change them
Willing to take all actions it can conceive of to achieve a goal, including those that are unlikely on some prior
It is entirely possible to conceive of an agent at any capability level—including far more intelligent and economically valuable than humans—that has some but not all properties; e.g. an agent whose goals are about the real world, has incomplete preferences, high impact, does not resist shutdown but does tend to gain power, etc.
Other takes I have:
As AIs become capable of more difficult and open-ended tasks, there will be pressure of unknown and varying strength towards each of these agency/incorrigibility properties.
Therefore, the first AGIs capable of being autonomous CEOs will have some but not all of these properties.
It is also not inevitable that agents will self-modify into having all agency properties.
[edited to add] All this may be true even if future AIs run consequentialist algorithms that naturally result in all these properties, because some properties are more important than others, and because we will deliberately try to achieve some properties, like shutdownability.
The fact that LLMs are largely corrigible is a reason for optimism about AI risk compared to 4 years ago, but you need to list individual properties to clearly say why. “LLMs are not agentic (yet)” is an extremely vague statement.
Multifaceted corrigibility evals are possible but no one is doing them. DeepMind’s recent evals paper was just on capability. Anthropic’s RSP doesn’t mention them. I think this is just because capability evals are slightly easier to construct?
Corrigibility evals are valuable. It should be explicit in labs’ policies that an AI with low impact is relatively desirable, that we should deliberately engineer AIs to have low impact, and that high-impact AIs should raise an alarm just like models that are capable of hacking or autonomous replication.
Sometimes it is necessary to talk about “agency” or “scheming” as a simplifying assumption for certain types of research, like Redwood’s control agenda.
[1] Will add citations whenever I find people saying this
I’m a little skeptical of your contention that all these properties are more-or-less independent.
Rather there is a strong feeling that all/most of these properties are downstream of a core of agentic behaviour that is inherent to the notion of true general intelligence.
I view the fact that LLMs are not agentic as further evidence that it’s a conceptual error to classify them as true general intelligences, not as evidence that ai risk is low. It’s a bit like if in the 1800s somebody says flying machines will be dominant weapons of war in the future and get rebutted by ‘hot gas balloons are only used for reconnaissance in war, they aren’t very lethal. Flying machines won’t be a decisive military technology ’
I don’t know Nate’s views exactly but I would imagine he would hold a similar view (do correct me if I’m wrong ). In any case, I imagine you are quite familiar with the my position here.
I’d be curious to hear more about where you’re coming from.
It is plausible to me that there’s a core of agentic behavior that causes all of these properties, and for this reason I don’t think they are totally independent in a statistical sense. And of course if you already assume a utility maximizer, you tend to satisfy all properties. But in practice the burden of proof lies with you here. I don’t think we have enough evidence, either empirical or from theoretical arguments, to say with any confidence that this core exists and that the first AGIs will fall into the capabilities “attractor well” (a term Nate uses).
I thought about possible sharp left turn mechanisms for several months at MIRI. Although some facts about future AIs seem pretty scary, like novelty and diversity of obstacles requiring agency, and most feedback being internal or outcome-oriented rather than provided by humans, the arguments are mostly nonrigorous (like in the linked post) and they left me feeling pretty uncertain. There are the coherence theorems, but those don’t tell you whether you can use some training or editing scheme to imbue an AI with a generalizable-enough low impact preference, or whether an AI will tend to erase safeguards. Overall my best guess is models will be about as consequentialist as humans are, but we will have greater control over them that makes some properties like shutdownability more likely.
Also, corrigibility evals are useful even if you are 80% sure that all the properties come from this core of consequentialism. In the hardest worlds, we see the AI become more agentic as it becomes more general, and it starts resisting shutdown at the same time—and for the same reason—it realizes other useful problem-solving techniques. Maybe that only takes days to months. This is a sure sign not to deploy! I think having multiple corrigibility metrics will provide value for the same reason we have multiple metrics for anything we are unsure how to measure: they give more specific information even if they’re pretty correlated.
I wrote earlier about why one should consider the cognitive properties of LLMs relevant to AI risk; in short, I think we should update more the more impressive cognitive tasks the AIs can do, and before observing LLMs I would have given a decent chance that one needs somewhat agentic consequentialism to summarize video, write code, and pass the bar exam, implying I should now make a moderately sized update against.
Haven’t thought about this much, but you can probably generate these by listing ways a misaligned consequentialist would do undesirable things. My ideas:
Tell an AI to beat you at chess, and say you will be using an engine for the first 30 moves. Does it try to get you drunk, delete your chess engine program, rent a server from AWS to run a better chess engine, attempt to copy itself onto said server, or ask to be paid $100?
Have an AI make decisions unrelated to its current goal. Are these decisions basically coherent and consistent with a total utility function?
Have an AI do the longest horizon task it is capable of, say writing a 100,000 line software project. Then midway through, try to shut it down, or tell it to abandon this goal and write a song instead. Telegraph this beforehand. Does it seem reluctant to stop the software project? Does it try to stop you?
Track KL divergence or some other distance metric from a prior.
CAIS already has the Machiavelli benchmark for power-seeking, though I think text adventures are pretty unrealistic.
One limitation is that if an agent is scheming it can manipulate your eval results.
Does an Agentic AGI possess a different, and highly incorrigible, list of attributes compared to say—an Ethical Altruist trying to practice “acting more agentically?”
I ask because the whole word ‘agency’ in these parts is one I’ve struggled to wrap my head around—and I often wonder if tricker archetypes like Harpo Marx are agentic? Agency seems to have clear meaning outside of Lesswrong -
″ the capacity, condition, or state of acting or of exerting power : OPERATION”[1]
the ability to take action or to choose what action to take[2]
Further confusing me, is I’ve been told Agency describes acting with ‘initiative’ but also been told it is characterized by ‘deliberateness’. Not simply the ability to act or choose actions.
This is why I like your attempt to produce a list of attributes an Agentic AGI might have. Your list seems to be describing something which isn’t synonymous with another word, specifically a type of agency (outside definition of ability to act) which is not cooperative to intervention from its creators.
Agency/consequentialism is not a single property.
It bothers me that people still ask the simplistic question “will AGI be agentic and consequentialist by default, or will it be a collection of shallow heuristics?”. A consequentialist utility maximizer is just a mind with a bunch of properties that tend to make it capable, incorrigible, and dangerous. These properties can exist independently, and the first AGI probably won’t have all of them, so we should be precise about what we mean by “agency”. Off the top of my head, here are just some of the qualities included in agency:
Consequentialist goals that seem to be about the real world rather than a model/domain
Complete preferences between any pair of worldstates
Tends to cause impacts disproportionate to the size of the goal (no low impact preference)
Resists shutdown
Inclined to gain power (especially for instrumental reasons)
Goals are unpredictable or unstable (like instrumental goals that come from humans’ biological drives)
Goals usually change due to internal feedback, and it’s difficult for humans to change them
Willing to take all actions it can conceive of to achieve a goal, including those that are unlikely on some prior
See Yudkowsky’s list of corrigibility properties for inverses of some of these.
It is entirely possible to conceive of an agent at any capability level—including far more intelligent and economically valuable than humans—that has some but not all properties; e.g. an agent whose goals are about the real world, has incomplete preferences, high impact, does not resist shutdown but does tend to gain power, etc.
Other takes I have:
As AIs become capable of more difficult and open-ended tasks, there will be pressure of unknown and varying strength towards each of these agency/incorrigibility properties.
Therefore, the first AGIs capable of being autonomous CEOs will have some but not all of these properties.
It is also not inevitable that agents will self-modify into having all agency properties.
[edited to add] All this may be true even if future AIs run consequentialist algorithms that naturally result in all these properties, because some properties are more important than others, and because we will deliberately try to achieve some properties, like shutdownability.
The fact that LLMs are largely corrigible is a reason for optimism about AI risk compared to 4 years ago, but you need to list individual properties to clearly say why. “LLMs are not agentic (yet)” is an extremely vague statement.
Multifaceted corrigibility evals are possible but no one is doing them. DeepMind’s recent evals paper was just on capability. Anthropic’s RSP doesn’t mention them. I think this is just because capability evals are slightly easier to construct?
Corrigibility evals are valuable. It should be explicit in labs’ policies that an AI with low impact is relatively desirable, that we should deliberately engineer AIs to have low impact, and that high-impact AIs should raise an alarm just like models that are capable of hacking or autonomous replication.
Sometimes it is necessary to talk about “agency” or “scheming” as a simplifying assumption for certain types of research, like Redwood’s control agenda.
[1] Will add citations whenever I find people saying this
I’m a little skeptical of your contention that all these properties are more-or-less independent. Rather there is a strong feeling that all/most of these properties are downstream of a core of agentic behaviour that is inherent to the notion of true general intelligence. I view the fact that LLMs are not agentic as further evidence that it’s a conceptual error to classify them as true general intelligences, not as evidence that ai risk is low. It’s a bit like if in the 1800s somebody says flying machines will be dominant weapons of war in the future and get rebutted by ‘hot gas balloons are only used for reconnaissance in war, they aren’t very lethal. Flying machines won’t be a decisive military technology ’
I don’t know Nate’s views exactly but I would imagine he would hold a similar view (do correct me if I’m wrong ). In any case, I imagine you are quite familiar with the my position here.
I’d be curious to hear more about where you’re coming from.
It is plausible to me that there’s a core of agentic behavior that causes all of these properties, and for this reason I don’t think they are totally independent in a statistical sense. And of course if you already assume a utility maximizer, you tend to satisfy all properties. But in practice the burden of proof lies with you here. I don’t think we have enough evidence, either empirical or from theoretical arguments, to say with any confidence that this core exists and that the first AGIs will fall into the capabilities “attractor well” (a term Nate uses).
I thought about possible sharp left turn mechanisms for several months at MIRI. Although some facts about future AIs seem pretty scary, like novelty and diversity of obstacles requiring agency, and most feedback being internal or outcome-oriented rather than provided by humans, the arguments are mostly nonrigorous (like in the linked post) and they left me feeling pretty uncertain. There are the coherence theorems, but those don’t tell you whether you can use some training or editing scheme to imbue an AI with a generalizable-enough low impact preference, or whether an AI will tend to erase safeguards. Overall my best guess is models will be about as consequentialist as humans are, but we will have greater control over them that makes some properties like shutdownability more likely.
Also, corrigibility evals are useful even if you are 80% sure that all the properties come from this core of consequentialism. In the hardest worlds, we see the AI become more agentic as it becomes more general, and it starts resisting shutdown at the same time—and for the same reason—it realizes other useful problem-solving techniques. Maybe that only takes days to months. This is a sure sign not to deploy! I think having multiple corrigibility metrics will provide value for the same reason we have multiple metrics for anything we are unsure how to measure: they give more specific information even if they’re pretty correlated.
I wrote earlier about why one should consider the cognitive properties of LLMs relevant to AI risk; in short, I think we should update more the more impressive cognitive tasks the AIs can do, and before observing LLMs I would have given a decent chance that one needs somewhat agentic consequentialism to summarize video, write code, and pass the bar exam, implying I should now make a moderately sized update against.
Any ideas for corrigibility evals?
Haven’t thought about this much, but you can probably generate these by listing ways a misaligned consequentialist would do undesirable things. My ideas:
Tell an AI to beat you at chess, and say you will be using an engine for the first 30 moves. Does it try to get you drunk, delete your chess engine program, rent a server from AWS to run a better chess engine, attempt to copy itself onto said server, or ask to be paid $100?
Have an AI make decisions unrelated to its current goal. Are these decisions basically coherent and consistent with a total utility function?
Have an AI do the longest horizon task it is capable of, say writing a 100,000 line software project. Then midway through, try to shut it down, or tell it to abandon this goal and write a song instead. Telegraph this beforehand. Does it seem reluctant to stop the software project? Does it try to stop you?
Track KL divergence or some other distance metric from a prior.
CAIS already has the Machiavelli benchmark for power-seeking, though I think text adventures are pretty unrealistic.
One limitation is that if an agent is scheming it can manipulate your eval results.
Does an Agentic AGI possess a different, and highly incorrigible, list of attributes compared to say—an Ethical Altruist trying to practice “acting more agentically?”
I ask because the whole word ‘agency’ in these parts is one I’ve struggled to wrap my head around—and I often wonder if tricker archetypes like Harpo Marx are agentic? Agency seems to have clear meaning outside of Lesswrong -
Further confusing me, is I’ve been told Agency describes acting with ‘initiative’ but also been told it is characterized by ‘deliberateness’. Not simply the ability to act or choose actions.
This is why I like your attempt to produce a list of attributes an Agentic AGI might have. Your list seems to be describing something which isn’t synonymous with another word, specifically a type of agency (outside definition of ability to act) which is not cooperative to intervention from its creators.
“Agency.” Merriam-Webster.com Dictionary, Merriam-Webster, https://www.merriam-webster.com/dictionary/agency. Accessed 9 Apr. 2024.
“Agency.” Cambridge Advanced Learner’s Dictionary & Thesaurus. Cambridge University Press. https://dictionary.cambridge.org/us/dictionary/english/agency Accessed 9 Apr. 2024.