2-D Robustness
This is a short note on a framing that was developed in collaboration with Joar Skalse, Chris van Merwijk and Evan Hubinger while working on Risks from Learned Optimization, but which did not find a natural place in the report.
Mesa-optimisation is a kind of robustness problem, in the following sense:
Since the mesa-optimiser is selected based on performance on the base objective, we expect it (once trained) to have a good policy on the training distribution. That is, we can expect the mesa-optimiser to act in a way that results in outcomes that we want, and to do so competently.
The place where we expect trouble is off-distribution. When the mesa-optimiser is placed in a new situation, I want to highlight two distinct failure modes; that is, outcomes which score poorly on the base objective:
The mesa-optimiser fails to generalise in any way, and simply breaks, scoring poorly on the base objective.
The mesa-optimiser robustly and competently achieves an objective that is different from the base objective, thereby scoring poorly on it.
Both of these are failures of robustness, but there is an important distinction to be made between them. In the first failure mode, the agent’s capabilities fail to generalise. In the second, its capabilities generalise, but its objective does not. This second failure mode seems in general more dangerous: if an agent is sufficiently capable, it might, for example, hinder human attempts to shut it down (if its capabilities are robust enough to generalise to situations involving human attempts to shut it down). These failure modes map to what Paul Christiano calls benign and malign failures in Techniques for optimizing worst-case performance.
This distinction suggests a framing of robustness that we have found useful while writing our report: instead of treating robustness as a scalar quantity that measures the degree to which the system continues working off-distribution, we can view robustness as a 2-dimensional quantity. Its two axes are something like “capabilities” and “alignment”, and the failure modes at different points in the space look different.
Unlike the 1-d picture, the 2-d picture suggests that more robustness is not always a good thing. In particular, robustness in capabilities is only good insofar is it is matched by robust alignment between the mesa-objective and the base objective. It may be the case that for some systems, we’d rather the system get totally confused in new situations than remain competent while pursuing the wrong objective.
Of course, there is a reason why we usually think of robustness as a scalar: one can define clear metrics for how well the system generalises, in terms of the difference between performance on the base objective on- and off-distribution. In contrast, 2-d robustness does not yet have an obvious way to ground its two axes in measurable quantities. Nevertheless, as an intuitive framing I find it quite compelling, and invite you to also think in these terms.
- How do we become confident in the safety of a machine learning system? by 8 Nov 2021 22:49 UTC; 133 points) (
- Utility ≠ Reward by 5 Sep 2019 17:28 UTC; 130 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- A bird’s eye view of ARC’s research by 23 Oct 2024 15:50 UTC; 116 points) (
- Clarifying inner alignment terminology by 9 Nov 2020 20:40 UTC; 107 points) (
- Experimentally evaluating whether honesty generalizes by 1 Jul 2021 17:47 UTC; 103 points) (
- Discussion: Objective Robustness and Inner Alignment Terminology by 23 Jun 2021 23:25 UTC; 73 points) (
- Classifying specification problems as variants of Goodhart’s Law by 19 Aug 2019 20:40 UTC; 72 points) (
- A simple environment for showing mesa misalignment by 26 Sep 2019 4:44 UTC; 71 points) (
- Relaxed adversarial training for inner alignment by 10 Sep 2019 23:03 UTC; 69 points) (
- Empirical Observations of Objective Robustness Failures by 23 Jun 2021 23:23 UTC; 63 points) (
- My Overview of the AI Alignment Landscape: Threat Models by 25 Dec 2021 23:07 UTC; 52 points) (
- Modeling Risks From Learned Optimization by 12 Oct 2021 20:54 UTC; 45 points) (
- Why is pseudo-alignment “worse” than other ways ML can fail to generalize? by 18 Jul 2020 22:54 UTC; 45 points) (
- Towards an empirical investigation of inner alignment by 23 Sep 2019 20:43 UTC; 44 points) (
- AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger by 18 Feb 2021 0:03 UTC; 43 points) (
- Expected impact of a career in AI safety under different opinions by 14 Jun 2022 14:25 UTC; 42 points) (EA Forum;
- Is the term mesa optimizer too narrow? by 14 Dec 2019 23:20 UTC; 39 points) (
- Comparing Four Approaches to Inner Alignment by 29 Jul 2022 21:06 UTC; 38 points) (
- Exploring safe exploration by 6 Jan 2020 21:07 UTC; 37 points) (
- Inner Alignment via Superpowers by 30 Aug 2022 20:01 UTC; 37 points) (
- Getting up to Speed on the Speed Prior in 2022 by 28 Dec 2022 7:49 UTC; 36 points) (
- Clarifying the confusion around inner alignment by 13 May 2022 23:05 UTC; 31 points) (
- Re-Define Intent Alignment? by 22 Jul 2021 19:00 UTC; 29 points) (
- [AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison by 26 Dec 2019 1:10 UTC; 26 points) (
- [AN #153]: Experiments that demonstrate failures of objective robustness by 26 Jun 2021 17:10 UTC; 25 points) (
- Safe exploration and corrigibility by 28 Dec 2019 23:12 UTC; 17 points) (
- [AN #67]: Creating environments in which to study inner alignment failures by 7 Oct 2019 17:10 UTC; 17 points) (
- Motivations, Natural Selection, and Curriculum Engineering by 16 Dec 2021 1:07 UTC; 16 points) (
- [AN #168]: Four technical topics for which Open Phil is soliciting grant proposals by 28 Oct 2021 17:20 UTC; 15 points) (
- [AN #83]: Sample-efficient deep learning with ReMixMatch by 22 Jan 2020 18:10 UTC; 15 points) (
- Is Fisherian Runaway Gradient Hacking? by 10 Apr 2022 13:47 UTC; 15 points) (
- [AN #66]: Decomposing robustness into capability robustness and alignment robustness by 30 Sep 2019 18:00 UTC; 12 points) (
- 23 Jul 2021 12:42 UTC; 11 points) 's comment on Re-Define Intent Alignment? by (
- 3 Feb 2021 22:58 UTC; 11 points) 's comment on Distinguishing claims about training vs deployment by (
- 27 Oct 2020 17:19 UTC; 8 points) 's comment on Security Mindset and Takeoff Speeds by (
- What is the difference between robustness and inner alignment? by 15 Feb 2020 13:28 UTC; 8 points) (
- 12 Jan 2020 19:10 UTC; 3 points) 's comment on Malign generalization without internal search by (
- 22 Nov 2021 17:22 UTC; 2 points) 's comment on List of EA funding opportunities by (EA Forum;
- 1 Nov 2020 11:24 UTC; 2 points) 's comment on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures by (
- 4 Nov 2020 21:26 UTC; 2 points) 's comment on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures by (
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
train a new system to do as well on the reward function as the original system
measure the number of training steps needed to reach this point for the new system.
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems’s ontology and capabilities.
Are there any examples of capability robustness failures that aren’t objective robustness failures?
Yes. This image is only a classifier. No mesa optimizer here. So we have only a capability robustness problem
Thanks for the example, but why this is a capabilities robustness problem and not an objective robustness problem, if we think of the objective as ‘classify pandas accurately’?
Insofar as it’s not a capability problem, I think it’s example of Goodharting and not inner misalignment/mesa optimization. The given objective (“minimize cross entropy loss”) is maximized on distribution by incorporating non-robust features (and also gives no incentive to be fully robust to adversarial examples, so even non-robust features that don’t really help with performance could still persist after training).
You might argue that there is no alignment without capabilities, since a sufficiently dumb model “can’t know what you want”. But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm.
Of course, you can always argue that this is just an alignment failure one step up; if you perform Inverse Optimal Control on the behavior of the robot and derive a revealed reward function, you’ll find that its . In other words, you can invoke the fact that for biased agents, there doesn’t seem to be a super principled way of dividing up capabilities and preferences in the first place. I think this is probably going too far; there are still examples in practice (like the intractable search example) where thinking about it as a capabilities robustness issue is more natural than thinking about it as a objective problem.
Thanks for the reply!
I’d be interested to see actual examples of this, if there are any. But also, how would this not be an objective robustness failure if we frame the objective as “maximize reward”?
Do you mean to say that its reward function will be indistinguishable from its policy?
Interesting paper, thanks! If a policy cannot be decomposed into a planning algorithm and a reward function anyway, it’s unclear to me why 2D-robustness would be a better framing of robustness than just 1D-robustness.
I have some toy examples from a paper I worked on: https://proceedings.mlr.press/v144/jain21a
But I think this is a well known issue in robotics, because SOTA trajectory planning is often gradient-based (i.e. local). You definitely see this on any “hard” robotics task where initializing a halfway decent trajectory is hard. I’ve heard from Anca Dragan (my PhD advisor) that this happens with actual self driving car planners as well.
Oops, sorry, the answer got cutoff somehow. I meant to say that if you take a planner that’s suboptimal, and look at the policy it outputs, and then rationalize that policy assuming that the planner is optimal, you’ll get a reward function that is different from the reward function you put in. (Basically what the Armstrong + Mindermann paper says.)
Well, the paper doesn’t show that you can’t decompose it, but merely that the naive way of decomposing observed behavior into capabilities and objectives doesn’t work without additional assumptions. But we have additional information all the time! People can tell when other people are failing due to incompetence via misalignment, for example. And in practice, we can often guess whether or not a failure is due to capabilities limitations or objective robustness failures, for example by doing experiments with fine-tuning or prompting.
The reason we care about 2D alignment is that capability failures seem much more benign than alignment failures. Besides the reasons given in the main post, we might also expect that capability failures will go away with scale, while alignment failures will become worse with scale. So knowing whether or not something is a capability robustness failure vs an alignment one can inform you as to the importance and neglectedness of research directions.
Thanks—this helps.