I feel positively about this finally being published, but want to point out one weakness in the argument, which I also sent to Jeremy.
I don’t think the goals of capable agents are well-described by combinations of pure “consequentialist” goals and fixed “deontological” constraints. For example, the AI’s goals and constraints could have pointers to concepts that it refines over time, including from human feedback or other sources of feedback. This is similar to corrigible alignment in RLO but the pointer need not directly point at “human values”. I think this fact has important safety implications, because goal objects robust to capabilities not present early in training are possible, and we could steer agents towards them using some future descendant of RepE.
I agree that combinations of pure consequentialism and deontology don’t describe all possible goals for AGI.
“Do what this person means by what they says” seems like a perfectly coherent goal. It’s neither consequentialist nor deontological (in the traditional sense of fixed deontological rules). I think this is subtly different than IRL or other schemes for maximizing an unknown utility function of the user’s (or humanity’s) preferences. This goal limits the agent to reasoning about the meaning of only one utterance at a time, not the broader space of true preferences.
This scheme gets much safer if you can include a second (probably primary) goal of “don’t do anything major without verifying that my person actually wants me to do it”. Of course defining “major” is a challenge, but I don’t think it’s an unsolvable challenge (;particularly if you’re aligning an AGI with some understanding of natural language. I’ve explored this line of thought a little in Corrigibility or DWIM is an attractive primary goal for AGI, and I’m working on another post to explore this more thoroughly.
In a multi-goal scheme, making “don’t do anything major without approval” the strongest goal might provide some additional safety. If it turns out that alignment isn’t stable and reflection causes the goal structure to collapse, the AGI probably winds up not doing anything at all. Of course there are still lots of challenges and things to work out in that scheme.
I think that we basically have no way of ensuring that we get this nice “goals based on pointers to the correct concepts”/corrigible alignment thing using behavioral training. This seems like a super specific way to set up the AI, and there are so many degrees of freedom that behavioral training doesn’t distinguish.
For the Representation Engineering thing, I think the “workable” version of this basically looks like “Retarget the Search”, where you somehow do crazy good interp and work out where the “optimizer” is, and then point that at the right concepts which you also found using interp. And for some reason, the AI is set up such that you can “retarget it” with breaking everything. I expect if we don’t actually understand how “concepts” are represented in AIs and instead use something shallower (e.g. vectors or SAE neurons) then these will end up not being robust enough. I don’t expect RepE will actually change an AI’s goals if we have no idea how the goal-directness works in the first place.
I definitely don’t expect to be able to representation engineer our way into building an AI that is corrigible aligned, and remains that way even when it is learning a bunch of new things and is in very different distributions. (I do think that actually solving this problem would solve a large amount of the alignment problem)
What follows will all be pretty speculative, but I still think should probably provide some substantial evidence for more optimism.
I think that we basically have no way of ensuring that we get this nice “goals based on pointers to the correct concepts”/corrigible alignment thing using behavioral training. This seems like a super specific way to set up the AI, and there are so many degrees of freedom that behavioral training doesn’t distinguish.
The results in Robust agents learn causal world models suggest that robust models (to distribution shifts; arguably, this should be the case for ~all substantially x-risky models) should converge towards learning (approximately) the same causal world models. This talk suggests theoretical reasons to expect that the causal structure of the world (model) will be reflected in various (activation / rep engineering-y, linear) properties inside foundation models (e.g. LLMs), usable to steer them.
For the Representation Engineering thing, I think the “workable” version of this basically looks like “Retarget the Search”, where you somehow do crazy good interp and work out where the “optimizer” is, and then point that at the right concepts which you also found using interp. And for some reason, the AI is set up such that you can “retarget it” with breaking everything.
I definitely don’t expect to be able to representation engineer our way into building an AI that is corrigible aligned, and remains that way even when it is learning a bunch of new things and is in very different distributions. (I do think that actually solving this problem would solve a large amount of the alignment problem).
The activation / representation engineering methods might not necessarily need to scale that far in terms of robustness, especially if e.g. you can complement them with more control-y methods / other alignment methods / Swiss cheese models of safety more broadly; and also plausibly because they’d “only” need to scale to ~human-level automated alignment researchers / scaffolds of more specialized such automated researchers, etc. And again, based on the above theoretical results, future models might actually be more robustly steerable ‘by default’ / ‘for free’.
Thanks! I think that our argument doesn’t depend on all possible goals being describable this way. It depends on useful tasks (that AI designers are trying to achieve) being driven in large part by pursuing outcomes. For a counterexample, behavior that is defined entirely by local constraints (e.g. a calculator, or “hand on wall maze algorithm”) aren’t the kind of algorithm that is a source of AI risk (and also isn’t as useful in some ways).
Your example of a pointer to a goal is a good edge case for our way of defining/categorizing goals. Our definitions don’t capture this edge case properly. But we can extend the definitions to include it, e.g. if the goal that ends up eventually being pursued is an outcome, then we could define the observing agent as knowing that outcome in advance. Or alternatively, we could wait until the agent has uncovered its consequentialist goal, but hasn’t yet completed it. In both these cases we can treat it as consequentialist. Either way it still has the property that leads to danger, which is the capacity to overcome large classes of obstacles and still get to its destination.
I’m not sure what you mean by “goal objects robust to capabilities not present early in training”. If you mean “goal objects that specify shutdownable behavior while also specifying useful outcomes, and are robust to capability increases”, then I agree that such objects exist in principle. But I could argue that this isn’t very natural, if this is a crux and I’m understanding what you mean correctly?
I agree with Thomas’s statement “I don’t think the goals of capable agents are well-described by combinations of pure “consequentialist” goals and fixed “deontological” constraints.”, for kinda different (maybe complementary) reasons, see here & here.
I feel positively about this finally being published, but want to point out one weakness in the argument, which I also sent to Jeremy.
I don’t think the goals of capable agents are well-described by combinations of pure “consequentialist” goals and fixed “deontological” constraints. For example, the AI’s goals and constraints could have pointers to concepts that it refines over time, including from human feedback or other sources of feedback. This is similar to corrigible alignment in RLO but the pointer need not directly point at “human values”. I think this fact has important safety implications, because goal objects robust to capabilities not present early in training are possible, and we could steer agents towards them using some future descendant of RepE.
I agree that combinations of pure consequentialism and deontology don’t describe all possible goals for AGI.
“Do what this person means by what they says” seems like a perfectly coherent goal. It’s neither consequentialist nor deontological (in the traditional sense of fixed deontological rules). I think this is subtly different than IRL or other schemes for maximizing an unknown utility function of the user’s (or humanity’s) preferences. This goal limits the agent to reasoning about the meaning of only one utterance at a time, not the broader space of true preferences.
This scheme gets much safer if you can include a second (probably primary) goal of “don’t do anything major without verifying that my person actually wants me to do it”. Of course defining “major” is a challenge, but I don’t think it’s an unsolvable challenge (;particularly if you’re aligning an AGI with some understanding of natural language. I’ve explored this line of thought a little in Corrigibility or DWIM is an attractive primary goal for AGI, and I’m working on another post to explore this more thoroughly.
In a multi-goal scheme, making “don’t do anything major without approval” the strongest goal might provide some additional safety. If it turns out that alignment isn’t stable and reflection causes the goal structure to collapse, the AGI probably winds up not doing anything at all. Of course there are still lots of challenges and things to work out in that scheme.
I think that we basically have no way of ensuring that we get this nice “goals based on pointers to the correct concepts”/corrigible alignment thing using behavioral training. This seems like a super specific way to set up the AI, and there are so many degrees of freedom that behavioral training doesn’t distinguish.
For the Representation Engineering thing, I think the “workable” version of this basically looks like “Retarget the Search”, where you somehow do crazy good interp and work out where the “optimizer” is, and then point that at the right concepts which you also found using interp. And for some reason, the AI is set up such that you can “retarget it” with breaking everything. I expect if we don’t actually understand how “concepts” are represented in AIs and instead use something shallower (e.g. vectors or SAE neurons) then these will end up not being robust enough. I don’t expect RepE will actually change an AI’s goals if we have no idea how the goal-directness works in the first place.
I definitely don’t expect to be able to representation engineer our way into building an AI that is corrigible aligned, and remains that way even when it is learning a bunch of new things and is in very different distributions. (I do think that actually solving this problem would solve a large amount of the alignment problem)
What follows will all be pretty speculative, but I still think should probably provide some substantial evidence for more optimism.
The results in Robust agents learn causal world models suggest that robust models (to distribution shifts; arguably, this should be the case for ~all substantially x-risky models) should converge towards learning (approximately) the same causal world models. This talk suggests theoretical reasons to expect that the causal structure of the world (model) will be reflected in various (activation / rep engineering-y, linear) properties inside foundation models (e.g. LLMs), usable to steer them.
I don’t think the “optimizer” ontology necessarily works super-well with LLMs / current SOTA (something like simulators seems to me much more appropriate); with that caveat, e.g. In-Context Learning Creates Task Vectors and Function Vectors in Large Language Models (also nicely summarized here), A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seem to me like (early) steps in this direction already. Also, if you buy the previous theoretical claims (of convergence towards causal world models, with linear representations / properties), you might quite reasonably expect such linear methods to potentially work even better in more powerful / more robust models.
The activation / representation engineering methods might not necessarily need to scale that far in terms of robustness, especially if e.g. you can complement them with more control-y methods / other alignment methods / Swiss cheese models of safety more broadly; and also plausibly because they’d “only” need to scale to ~human-level automated alignment researchers / scaffolds of more specialized such automated researchers, etc. And again, based on the above theoretical results, future models might actually be more robustly steerable ‘by default’ / ‘for free’.
Haven’t read it as deeply as I’d like to, but Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models seems like potentially significant progress towards formalizing / operationalizing (some of) the above.
Thanks!
I think that our argument doesn’t depend on all possible goals being describable this way. It depends on useful tasks (that AI designers are trying to achieve) being driven in large part by pursuing outcomes. For a counterexample, behavior that is defined entirely by local constraints (e.g. a calculator, or “hand on wall maze algorithm”) aren’t the kind of algorithm that is a source of AI risk (and also isn’t as useful in some ways).
Your example of a pointer to a goal is a good edge case for our way of defining/categorizing goals. Our definitions don’t capture this edge case properly. But we can extend the definitions to include it, e.g. if the goal that ends up eventually being pursued is an outcome, then we could define the observing agent as knowing that outcome in advance. Or alternatively, we could wait until the agent has uncovered its consequentialist goal, but hasn’t yet completed it. In both these cases we can treat it as consequentialist. Either way it still has the property that leads to danger, which is the capacity to overcome large classes of obstacles and still get to its destination.
I’m not sure what you mean by “goal objects robust to capabilities not present early in training”. If you mean “goal objects that specify shutdownable behavior while also specifying useful outcomes, and are robust to capability increases”, then I agree that such objects exist in principle. But I could argue that this isn’t very natural, if this is a crux and I’m understanding what you mean correctly?
This is indeed a crux, maybe it’s still worth talking about.
I agree with Thomas’s statement “I don’t think the goals of capable agents are well-described by combinations of pure “consequentialist” goals and fixed “deontological” constraints.”, for kinda different (maybe complementary) reasons, see here & here.