evhub(Evan Hubinger)

Karma: 12,054

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

Dependent Type Theory and Zero-Shot Reasoning

evhub11 Jul 2018 1:16 UTC

27 points

3 comments5 min readLW link

evhub 11 Jul 2018 3:34 UTC
LW: 2 AF: 2
AF
on: Conditions under which misaligned subagents can (not) arise in classifiers
There are a couple of pieces of this that I disagree with:
- I think claim 1 is wrong because even if the memory is unhelpful, the agent which uses it might be simpler, and so you might still end up with an agent. My intuition is that just specifying a utility function and an optimization process is often much easier than specifying the complete details of the actual solution, and thus any sort of program-search-based optimization process (e.g. gradient descent in a nn) has a good chance of finding an agent.
- I think claim 3 is wrong because agenty solutions exist for all tasks, even classification tasks. For example, take the function which spins up an agent, tasks that agent with the classification task, and then takes that agent’s output. Unless you’ve done something to explicitly remove agents from your search space, this sort of solution always exists.
- Thus, I think claim 6 is wrong due to my complaints about claims 1 and 3.

evhub 11 Jul 2018 19:43 UTC
LW: 4 AF: 3
AF
in reply to: rk’s comment on: Dependent Type Theory and Zero-Shot Reasoning
Dimensional analysis is absolutely an instance of what I’m talking about!
As for only being able to do constructive stuff, you actually can do classical stuff as well, but you have to explicitly assume the law of the excluded middle. For example, if in Lean I write
```
axiom lem (P: Prop): P ∨ ¬P
```
then I can start doing classical reasoning.
Also, you’re totally right that you could also do $h (h (f x x))$ and so on as much as you want, but there’s no real reason to do so, since if you start from the simplest possible way to do it you’ll solve the problem by the time you get to $h (f x x)$ .

evhub 15 Sep 2018 23:09 UTC
1 point
on: Owen’s short-form blog
Is there an RSS feed?

Nuances with ascription universality

evhub12 Feb 2019 23:38 UTC

20 points

1 comment2 min readLW link

A Concrete Proposal for Adversarial IDA

evhub26 Mar 2019 19:50 UTC

16 points

5 comments5 min readLW link

evhub 20 Apr 2019 4:58 UTC
LW: 3 AF: 2
AF
in reply to: Rohin Shah’s comment on: A Concrete Proposal for Adversarial IDA
I considered collapsing all of it into one (as Paul has talked about previously), but as you note the amplification procedure I describe here basically already does that. The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to $H$ . I do agree that you could include the iteration procedure described here into the amplification procedure, which is probably a good idea, though you’d probably want to anneal $α$ in that situation, as $Adv$ starts out really bad, whereas in this setup you shouldn’t have to do any annealing because by the time you get to that point $Adv$ should be performing well enough that it will automatically anneal as its predictions get better. Also, apologies for the math—I didn’t really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.
(Also, the sum isn’t a typo—I’m using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

184 points

42 comments12 min readLW link 3 reviews

Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

1 Jun 2019 20:52 UTC

84 points

48 comments12 min readLW link

evhub 1 Jun 2019 22:25 UTC
LW: 19 AF: 8
AF
in reply to: Raemon’s comment on: Risks from Learned Optimization: Introduction
The word mesa is Greek meaning into/inside/within, and has been proposed as a good opposite word to meta, which is Greek meaning about/above/beyond. Thus, we chose mesa based on thinking about mesa-optimization as conceptually dual to meta-optimization—whereas meta is one level above, mesa is one level below.

evhub 1 Jun 2019 22:46 UTC
LW: 23 AF: 6
AF
in reply to: ESRogs’s comment on: Risks from Learned Optimization: Introduction
I don’t have a good term for that, unfortunately—if you’re trying to build an aligned AI, “human values” could be the right term, though in most cases you really just want “move one strawberry onto a plate without killing everyone,” which is quite a lot less than “optimize for all human values.” I could see how meta-objective might make sense if you’re thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.

Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the “classical” alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you’re good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.

evhub 3 Jun 2019 4:03 UTC
13 points
in reply to: Pattern’s comment on: Conditions for Mesa-Optimization
No, not at all—we distinguish robustly aligned mesa-optimizers, which are aligned on and off distribution, from pseudo-aligned mesa-optimizers, which appear to be aligned on distribution, but are not necessarily aligned off-distribution. For the full glossary, see here.

evhub 3 Jun 2019 22:34 UTC
LW: 4 AF: 3
AF
in reply to: Rohin Shah’s comment on: Conditions for Mesa-Optimization
It definitely will vary with the environment, though the question is degree. I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both. The footnote should probably be rephrased so as not to assert that it is completely independent, as I agree that it obviously isn’t, but just that it needs to be relatively independent, with the amount of optimization power dominating for the model to make sense.

Renamed $x$ to $x^{*}$ —good catch (though editing doesn’t appear to be working for me right now—it should show up in a bit)!

Algorithmic range is very similar to model capacity, except that we’re thinking slightly more broadly as we’re more interested in the different sorts of general procedures your model can learn to implement than how many layers of convolutions you can do. That being said, they’re basically the same thing.

evhub 3 Jun 2019 22:36 UTC
LW: 1 AF: 1
AF
in reply to: Rohin Shah’s comment on: Conditions for Mesa-Optimization
The idea would be that all of this would be learned—if the optimization machinery is entirely internal to the system, it can choose how to use that optimization machinery arbitrarily. We talk briefly about systems where the optimization is hard-coded, but those aren’t mesa-optimizers. Rather, we’re interested in situations where your learned algorithm itself performs optimization internal to its own workings—optimization it could re-use to do prediction or vice versa.

evhub 3 Jun 2019 22:37 UTC
LW: 1 AF: 1
AF
in reply to: Rohin Shah’s comment on: Conditions for Mesa-Optimization
The argument in this post is just that it might help prevent mesa-optimization from happening at all, not that it would make it more aligned. The next post will be about how to align mesa-optimizers.

evhub 3 Jun 2019 22:50 UTC
LW: 1 AF: 1
AF
in reply to: Rohin Shah’s comment on: Conditions for Mesa-Optimization
I believe AlphaZero without MCTS is still very good but not superhuman—International Master level, I believe. That being said, it’s unclear how much optimization/search is currently going on inside of AlphaZero’s policy network. My suspicion would be that currently it does some, and that to perform at the same level as the full AlphaZero it would have to perform more.

I added a footnote regarding capacity limitations (though editing doesn’t appear to be working for me right now—it should show up in a bit). As for the broader point, I think it’s just a question of degree—for a sufficiently diverse environment, you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization. So the question is just what does “perform well” mean and what threshold are you drawing for “internally performs something like a tree search.”
What links here?
- Modeling Risks From Learned Optimization by Ben Cottier (12 Oct 2021 20:54 UTC; 45 points)

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

4 Jun 2019 1:20 UTC

103 points

17 comments13 min readLW link

evhub 4 Jun 2019 18:06 UTC
LW: 8 AF: 5
AF
in reply to: Rohin Shah’s comment on: The Inner Alignment Problem
I agree with that as a general takeaway, though I would caution that I don’t think it’s always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.

Also, yeah, that was backwards—it should be fixed now.

evhub 4 Jun 2019 21:00 UTC
LW: 3 AF: 2
AF
in reply to: evhub’s comment on: Conditions for Mesa-Optimization
I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I’m keeping algorithmic range here.

evhub 5 Jun 2019 20:07 UTC
9 points
in reply to: Lukas Finnveden’s comment on: The Inner Alignment Problem
I think it’s very rarely going to be the case that the simplest possible mesa-objective that produces good behavior on the training data will be the base objective. Intuitively, we might hope that, since we are judging the mesa-optimizer based on the base objective, the simplest way to achieve good behavior will just be to optimize for the base objective. But importantly, you only ever test the base objective over some finite training distribution. Off-distribution, the mesa-objective can do whatever it wants. Expecting the mesa-objective to exactly mirror the base objective even off-distribution where the correspondence was never tested seems very problematic. It must be the case that precisely the base objective is the unique simplest objective that fits all the data points, which, given the massive space of all possible objectives, seems unlikely, even for very large training datasets. Furthermore, the base and mesa- optimizers are operating under different criteria for simplicity: as you mention, food, pain, mating, etc. are pretty simple to humans, because they get to refer to sensory data, but very complex from the perspective of evolution, which doesn’t.

That being said, you might be able to get pretty close, even if you don’t hit the base objective exactly, though exactly how close is very unclear, especially once you start considering other factors like computational complexity as you mention.

More generally, I think the broader point here is just that there are a lot of possible pseudo-aligned mesa-objectives: the space of possible objectives is very large, and the actual base objective occupies only a tiny fraction of that space. Thus, to the extent that you are optimizing for anything other than pure similarity to the base objective, you’re likely to find an optimum which isn’t exactly the base objective, just simply because there are so many different possible objectives for you to find, and it’s likely that one of them will gain more from increased simplicity (or anything else) than it loses by being farther away from the base objective.

evhub(Evan Hubinger)

Depen­dent Type The­ory and Zero-Shot Reasoning

Nuances with as­crip­tion universality

A Con­crete Pro­posal for Ad­ver­sar­ial IDA

Risks from Learned Op­ti­miza­tion: Introduction

Con­di­tions for Mesa-Optimization

The In­ner Align­ment Problem

Dependent Type Theory and Zero-Shot Reasoning

Nuances with ascription universality

A Concrete Proposal for Adversarial IDA

Risks from Learned Optimization: Introduction

Conditions for Mesa-Optimization

The Inner Alignment Problem