Oliver Sourbut comments on What Is The Alignment Problem?

Oliver Sourbut Jan 16, 2025, 12:49 PM
15 points
1
Organisms in general typically sense their environment and take different actions across a wide variety of environmental conditions, so as to cause there to be approximate copies of themselves in the future.^[4] That’s basic agency.^[5]
I agree with this breakdown, except I start the analysis with moment-to-moment deliberation, and note that having there (continue to) be relevantly similar deliberators is a very widely-applicable intermediate objective, from where we get control (‘basic agency’) but also delegation and replication.
The way the terms have typically been used historically, the simplest summary would be:
- Today’s LLMs and image generators are generative models of (certain parts of) the world.
- Systems like e.g. o1 are somewhat-general planners/solvers on top of those models. Also, LLMs can be used directly as planners/solvers when suitably prompted or tuned.
- To go from a general planner/solver to an agent, one can simply hook the system up to some sensors and actuators (possibly a human user) and specify a nominal goal… assuming the planner/solver is capable enough to figure it out from there.
Yep! But (I think maybe you’d agree) there’s a lot of bleed between these abstractions, especially when we get to heavily finetuned models. For example...
Applying all that to typical usage of LLMs (including o1-style models): an LLM isn’t the kind of thing which is aligned or unaligned, in general. If we specify how the LLM is connected to the environment (e.g. via some specific sensors and actuators, or via a human user), then we can talk about both (a) how aligned to human values is the nominal objective given to the LLM^[8], and (b) how aligned to the nominal objective is the LLM’s actual effects on its environment. Alignment properties depend heavily on how the LLM is wired up to the environment, so different usage or different scaffolding will yield different alignment properties.
Yes and no? I’d say that the LLM-plus agent’s objectives are some function of
- incompletely-specified objectives provided by operators
- priors and biases from training/development
  - pretraining
  - finetuning
- scaffolding/reasoning structure (including any multi-context/multi-persona interactions, internal ratings, reflection, refinement, …)
  - or these things developed implicitly through structured CoT
- drift of various kinds
and I’d emphasise that the way that these influences interact is currently very poorly characterised. But plausibly the priors and biases from training could have nontrivial influence across a wide variety of scenarios (especially combined with incompletely-specified natural-language objectives), at which point it’s sensible to ask ‘how aligned’ the LLM is. I appreciate you’re talking in generalities, but I think in practice this case might take up a reasonable chunk of the space! For what it’s worth, the perspective of LLMs as pre-agent building blocks and conditioned-LLMs as closer to agents is underrepresented, and I appreciate you conceptually distinguishing those things here.