Many people believe that understanding “agency” is crucial for alignment, but as far as I know, there isn’t a canonical list of reasons why we care about agency. Please describe any reasons why we might care about the concept of agency for understanding alignment below. If you have multiple reasons, please list them in separate answers below.
Please also try to be specific as possible about what our goal is in the scenario. For example:
We want to know what an agent is so that we can determine whether or not a given AI is a dangerous agent
Whilst useful isn’t quite as good as:
We have an AI which may or may not have goals aligned with us and we want to know what will happen if these goals aren’t aligned. In particular, we are worried that the AI may develop instrumental incentives to seek power and we want to use interpretability tools to help us figure out how worried we should be for a particular system.
We can imagine a scale of such systems according to how it behaves in novel situations:
On the low end of this scale, we have a system that only follows extremely broad heuristics that it learned during training
On the high end of this scale, we have a system that uses general reasoning capabilities to discover new and innovative strategies even if it has never used anything like this strategy before, or seen it in the training data
We can think of such a scale as a concretisation of agency. This scale provides an indication of both:
How dangerous a system is likely to be: though a system with excellent heuristics and very little agency could be dangerous as well
What kind of precautions we might want to take: for example, how much can we rely on our off-switch to disable the agent
It’s plausible that interpretability tools might be able to give us some idea of where an agent is on this scale just by looking at the weights. So having a clear definition of this scale could help by clarifying what we are looking for.
In a few days, I’ll add any use cases I’m aware of myself that either haven’t been covered or that I don’t think have been adequately explained by different answers.
Agency may be a convergent property of most AI systems (or at least, of many systems people are likely to try to build), once those systems reach a certain capability level. The simplest and most useful way to predict the behavior of such systems may therefore be to model them as agents.
Perhaps we can avoid the problems posed by agency by building only tool AI. In that case, we probably still need a deep understanding of agency to make sure we avoid building an agent by accident. Instrumental convergence may imply that all sufficiently powerful AI systems start looking like agents eventually, past a certain point. Though, when a particular system is best modeled as an agent may depend on the particulars of that system, and we may want to push that point out as far as possible.
Boiling this down to a single specific reason about why we should care about agency: the concept of agency is likely to be key for creating simple, predictively accurate models of many kinds of powerful AI systems, regardless of whether the builders of those systems:
deeply understand the concepts of agency (or alignment) themselves
are deliberately trying to build an agent, or deliberately trying to avoid that, or just trying to build the most powerful system as fast as possible, without explicitly trying to avoid or target agency at all. (We seem to be in a world in which different people are trying all three of these things simultaneously.)
A few arguments or stubs of arguments for why the bolded claim is correct and important:
Agency is already a useful model of humans and human behavior in many situations.
Agency is already a useful model of some current AI systems: Mu Zero, Dreamer, Stockfish in the domains of their respective game worlds. It might soon be a useful model of constructions like Auto-GPT, in the domain of the real world.
The hypothesis that agency is instrumentally convergent means that it will be important in understanding all AI systems above a certain capability level.
This answer is a little bit confusing to me. You say that ”agency” may be an important concept even if we don’t have a deep understanding of what it entails. But how about a simple understanding?
I thought that when people spoke about ”agency” and AI, they meant something like ”a capacity to set their own final goals”, but then you claim that Stockfish could best be understood by using the concept of ”agency”. I don’t see how.
I myself kind of agree with the sentiment in the original post that ”agency” is a superfluous concept, but want to understand the opposite view?
To clarify, I wasn’t arguing it was a superficial concept, just trying to collect all the possible reasons together and nudge people towards making the use cases/applications more precise.
I didn’t claim that Stockfish was best understood by using the concept of agency. I claimed agency was one useful model. Consider the first diagram in this post on embedded agency: you can regard Stockfish as Alexi playing a chess video game. By modeling Stockfish as an agent in this situation, you can abstract its internal workings somewhat and predict that it will beat you at chess, even if you can’t predict the exact moves it makes, or why it makes those moves.
> I thought that when people spoke about ”agency” and AI, they meant something like ”a capacity to set their own final goals”
I think this is not what is meant by agency. Do I have the capacity to set (i.e. change) my own final goals? Maybe so, but I sure don’t want to. My final goals are probably complex and I may be uncertain about what they are, but one day I might be able to extrapolate them into something concrete and coherent.
I would say that an agent is something which is usefully modeled as having any kind of goals at all, regardless of what those goals are. As the ability of a system to achieve those goals increases, the model as an agent gets more useful and predictive relative to other models based on (for example) the system’s internal workings. If I want to know whether Kasparov is going to beat me at chess, a detailed model of neuroscience and a scan of his brain is less useful than modeling him as an agent whose goal is to win the game. Similarly, for Mu Zero, a mechanistic understanding of the artificial neural networks it is comprised of is probably less useful than modeling it as a capable Go agent, when predicting what moves it will make.
Is there a fixed, concrete, definition of “convergent property” or “instrumentally convergent” that most folks can agree on?
From what I see it’s more loosely and vaguely defined then “agency” itself, so it’s not really dispelling the confusion anyone may have.
I don’t know if there’s a standard definition or reference for instrumental convergence other than the LW tag, but convergence in general is a pretty well-known phenomenon.
For example, many biological mechanisms which evolved independently end up looking remarkably similar, because that just happens to be the locally-optimal way of doing things, if you’re in the design space of of iterative mutation of DNA.
Similarly in all sorts of engineering fields, methods or tools or mechanisms are often re-invented independently, but end up converging on very functionally or even visually similar designs, because they are trying to accomplish or optimize for roughly the same thing, and there’s only so many ways of doing so optimally, given enough constraints.
Instrumental convergence in the context of agency and AI is just that principle applied to strategic thinking and mind design specifically. In that context, it’s more of a hypothesis, since we don’t actually have more than one example of a human-level intelligence being developed. But even if mind designs don’t converge on some kind of optimal form, artificial minds could still be efficient w.rt. humans, which would have many of the same implications.
This is one of the answers: https://www.alignmentforum.org/posts/FWvzwCDRgcjb9sigb/why-agent-foundations-an-overly-abstract-explanation
Summary: John describes the problems of inner and outer alignment. He also describes the concept of True Names—mathematical formalisations that hold up under optimisation pressure. He suggests that having a “True Name” for optimizers would be useful if we wanted to inspect a trained system for an inner optimiser and not risk missing something.
He further suggests that the concept of agency breaks down into lower-level components like “optimisation”, “goals”, “world models”, ect. It would be possible to make further arguments about how these lower-level concepts are important for AI safety.
Also important to note:
i.e. if we were to pin-down something we actually care about, that’d be “a system exhibiting consequentialism”, because those are the kind of systems that will end up shaping our lightcone and more. Consequentialism is convergent in an optimization process, i.e. the “deep structure of optimization”. Terms like “goals” or “agency” are shadows of consequentialism, finite approximations of this deep structure.
And by the virtue of being finite approximations (eg they’re embedded), these “agents” have a bunch of convergent properties that makes it easy for us to reason about the “deep structure” themselves, like eg modularity, having a world-model, etc (check johnswentworth’s comment).
Edit: Also the following quote
Could you please clarify in which sense you use the word “agency”?
One sense is the technicality of setting oneself problems and choosing things to pursue instead of sitting IDLE waiting for commands. For example, the difference between AutoGPT and ChatGPT. (Except, AutoGPT doesn’t choose for oneself the highest-level problems, but it could be engineered to do so trivially as well.)
I think that the presence or absence of this kind of agency is not relevant to technical alignment:
First, we also want to protect from AI misuse (by humans) and having “non-agentic” but extremely smart AI doesn’t solve the problem. If you try to extrapolate GPT capabilities, it’s obvious that superintelligent GPT-n (or other systems surrounding it, e.g., content filters, but without loss of generality we can consider it a single AI system) should be extremely well aligned with humanity so that it itself chooses to solve certain problems and refuse to solve others, and also choosing the specific way of solving this or that problem. Even though these problems were given to it by the users.
Second, even “non-agentic” systems like ChatGPT tend to create effectively agentic entities on cultural-techno-evolutionary timescales: see “Why Simulator AIs want to be Active Inference AIs”.
The second meaning of “agency” is synonymous with “resourcefulness”, “intelligence” (in some sense), the ability to overcome obstacles and not back down in the face of challenges. I don’t see how this meaning of “agency” is directly relevant to the alignment question.
The third possible meaning of “agency” is having some intrinsic opinion, volition, tendencies, values, or emotions. The opposite is being completely “neutral”, in some sense. I think complete neutrality just doesn’t physically exist. Every intelligence is biased, and things like inductive biases are not categorically distinct from moral biases, they actually lie on a continuum.
For this notion of agency, of course, we actually care about which exact opinions, tendencies, biases, and values AIs have: that’s the essence of alignment. But I suspect you didn’t have this meaning in mind.
“Could you please clarify in which sense you use the word “agency”?”—I guess I’m pretty confused by hearing you ask the question because I guess my whole point with this question was to clarify what is meant by “agency”.
It’s a bit like if I asked “What do we mean by subjective and objective?” and you asked “Could you please clarify ‘subjective’ and ‘objective’?” that would seem rather strange to me.
The first sense seems relevant to alignment in that the kinds of worries we might have and the kinds of things that would reassure us regarding these worries would seem very different between AutoGPT and ChatGPT, even though we can of course bootstrap an AutoGPT with ChatGPT. I guess the way that I see it “X directly poses threat Y” and “X can be bootstrapped into a system that poses threat Y” seem like distinct threats, even if we can sometimes collapse this distinction.
The second meaning of agency seems relevant as well. Regarding safety properties, there’s a big difference between a system that has just learned a few heuristics for power-seeking behavior in training and a system that can adapt on the fly to take advantage of any weaknesses in our security during deployment, even if it’s never done anything remotely like that before.
Let me ask a different question: why should I care about alignment for systems that are not agentic in the context of x-risk?
Could you clarify why you’re asking this question? Is it because you’re suggesting that defining “agency” would give us a way to restrict the kinds of systems that we need to consider?
I see some boundary cases: for example, I could theoretically imagine gradient descent creating something that mostly just uses heuristics and isn’t a full agent, but which nonetheless poses an x-risk. That said, this seems somewhat unlikely to me.
AI misuse
(Perhaps unavoidable) agent-like entities in the space of technology, culture, and memes, as I elaborated a bit more in my answer. These are egregores, religions, evolutionary lineages of technology, self-replicating AutoGPT-like viral entities, self-replicating prompts, etc.
To protect from both, even “non-agentic” AI must be aligned.