Vlad Mikulik

Karma: 751

Risks from Learned Optimization: Conclusion and Related Work

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 7, 2019, 7:53 PM

82 points

5 comments6 min readLW link

Vlad Mikulik Jun 6, 2019, 7:10 AM
LW: 4 AF: 3
AF
in reply to: Rohin Shah’s comment on: Conditions for Mesa-Optimization
Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.

I do think this is a pretty speculative argument, even for this sequence.

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 5, 2019, 8:16 PM

118 points

20 comments17 min readLW link

Vlad Mikulik Jun 4, 2019, 10:51 AM
LW: 4 AF: 3
AF
in reply to: Rohin Shah’s comment on: Conditions for Mesa-Optimization
The main benefit I see of hardcoding optimisation is that, assuming the system’s pieces learn as intended (without any mesa-optimisation happening in addition to the hardcoded optimisation) you get more access and control as a programmer over what the learned objective actually is. You could attempt to regress the learned objective directly to a goal you want, or attempt to enforce a certain form on it, etc. When the optimisation itself is learned*, the optimiser is more opaque, and you have fewer ways to affect what goal is learned: which weights of your enormous LSTM-based mesa-optimiser represent the objective?
This doesn’t solve the problem completely (you might still learn an objective that is very incorrect off-distribution, etc.), but could offer more control and insight into the system to the programmer.
*Of course, you can have learned optimisation where you keep track of the objective which is being optimised (like in Learning to Learn by Gradient Descent), but I’d class that more under hard-coded optimisation for the purposes of this discussion. Here I mean the kind of learned optimisation that happens where you’re not building the architecture explicitly around optimising or learning to optimise.

Vlad Mikulik Jun 4, 2019, 10:41 AM
LW: 4 AF: 3
AF
in reply to: Rohin Shah’s comment on: Conditions for Mesa-Optimization
The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:
For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.
The idea there isn’t that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because that’s a good way of approximating the output of human-like reasoning). In that case, the agent really is a mesa-optimiser, to the degree that a goal-directed human-like reasoner is an optimiser.
(I’m not sure to what degree it’s actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 4, 2019, 1:20 AM

105 points

17 comments13 min readLW link

Vlad Mikulik Jun 2, 2019, 8:03 PM
LW: 12 AF: 7
AF
in reply to: abramdemski’s comment on: Selection vs Control

to what extent are mesa-controllers with simple behavioural objectives going to be simple?

I’m not sure what “simple behavioural objective” really means. But I’d expect that for tasks requiring very simple policies, controllers would do, whereas the more complicated the policy required to solve a task, the more one would need to do some kind of search. Is this what we observe? I’m not sure. AlphaStar and OpenAI Five seem to do well enough in relatively complex domains without any explicit search built into the architecture. Are they using their recurrence to search internally? Who knows. I doubt it, but it’s not implausible.

certain kinds of mesa-controllers can be simple: the mesa-controllers which are more like my rocket example (explicit world-model; explicit representation of objective within that world model; but, optimal policy does not use any search).

The rocket example is interesting. I guess the question for me there is, what sorts of tasks admit an optimal policy that can be represented in this way? Here it also seems to me like the more complex an environment, the more implausible it seems that a powerful policy can be successfully represented with straightforward functions. E.g., let’s say we want a rocket not just to get to the target, but to self-identify a good target in an area and pick a trajectory that evades countermeasures. I would be somewhat surprised if we can still represent the best policy as a set of non-searchy functions. So I have this intuition that for complex state spaces, it’s hard to find pure controllers that do the job well.

Vlad Mikulik Jun 2, 2019, 3:38 PM
LW: 20 AF: 12
AF
on: Selection vs Control
(I am unfortunately currently bogged down with external academic pressures, and so cannot engage with this at the depth I’d like to, but here’s some initial thoughts.)

I endorse this post. The distinction explained here seems interesting and fruitful.

I agree with the idea to treat selection and control as two kinds of analysis, rather than as two kinds of object – I think this loosely maps onto the distinction we make between the mesa-objective and the behavioural objective. The former takes the selection view of the learned algorithm; the latter takes the control view.

At least speaking for myself (the other authors might have different thoughts on this), the decision to talk explicitly in terms of the selection view in the mesa-optimiser post is based on an intuition that selectors, in general, have more coherently defined counterfactual behaviour. That is, given a very different input, a selector will still select an output that scores well on its mesa-objective, because that’s how selectors work. Whereas a controller, to the degree it optimises for an objective, seems more likely to just completely stop working on a different input. I have fairly low confidence in this argument, however: it seems to me that one can plausibly have pretty coherent counterfactual behaviour in a very broad distribution even without doing selection. And since it is ultimately the behaviour that does the damage, it would be good to have a working distinction that is based purely on that. We (the mesa-optimisation authors) haven’t been able to come up with one.

Another reason to be interested in selectors is that in RL, the learned algorithm is supposed to fill a controller role. So, restricting attention to selectors allows to talk at least somewhat meaningfully about non-optimiser agents, which is otherwise difficult, as any learned agent is in a controller-shaped context.

In any case, I hope that more work happens on this problem, either dissolving the need to talk about optimisation, or at least making all these distinctions more precise. The vagueness of everything is currently my biggest worry about the legitimacy of mesa-optimiser concerns.

Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 1, 2019, 8:52 PM

84 points

48 comments12 min readLW link

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

May 31, 2019, 11:44 PM

187 points

42 comments12 min readLW link 3 reviews

Vlad Mikulik Mar 27, 2019, 12:22 AM
LW: 3 AF: 2
AF
in reply to: TurnTrout’s comment on: More realistic tales of doom
The goal that the agent is selected to score well on is not necessarily the goal that the agent is itself pursuing. So, unless the agent’s internal goal matches the goal for which it’s selected, the agent might still seek influence because its internal goal permits that. I think this is in part what Paul means by “Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges)”

Vlad Mikulik Jul 17, 2018, 5:27 AM
2 points
on: Probabilistic decision-making as an anxiety-reduction technique
I also use a simple version of this, with a key extra step at the end:

1) have a decision you are unsure about. 2) perform randomisation (I usually just use a coin). 3) notice how the outcome makes you feel. If you find that you wish the coin landed the other way, override the decision and do what you secretly wanted to do all along.

You might think the third step defeats the purpose of the exercise, but so long as you actually commit to following the randomisation most of the time, it gives you direct access to very useful information. It also sets up the right incentive, wherein you never really need to work your willpower against your desires (except, I guess, the desire to deliberate more).

I mostly use this for a slightly different use case – inconsequential decisions like where to eat or small purchases, where taking a lot of time to optimise isn’t worth it. Your mileage may vary with more important decisions, but I see no reason in principle this couldn’t work.

Vlad Mikulik Jul 11, 2018, 10:16 PM
LW: 1 AF: 1
AF
in reply to: AlexMennen’s comment on: Clarifying Consequentialists in the Solomonoff Prior
I agree. That’s what I meant when I wrote there will be TMs that artificially promote S itself. However, this would still mean that most of S’s mass in the prior would be due to these TMs, and not due to the natural generator of the string.

Furthermore, it’s unclear how many TMs would promote S vs S’ or other alternatives. Because of this, I don’t now whether the prior would be higher for S or S’ from this reasoning alone. Whichever is the case, the prior no longer reflects meaningful information about the universe that generates S and whose inhabitants are using the prefix to choose what to do; it’s dominated by these TMs that search for prefixes they can attempt to influence.

Vlad Mikulik Jul 11, 2018, 7:55 PM
LW: 1 AF: 1
AF
in reply to: AlexMennen’s comment on: Clarifying Consequentialists in the Solomonoff Prior
I agree that this probably happens when you set out to mess with an arbitrary particular S, I.e. try to make some S’ that shares a prefix with S as likely as S.

However, some S are special, in the sense that their prefixes are being used to make very important decisions. If you, as a malicious TM in the prior, perform an exhaustive search of universes, you can narrow down your options to only a few prefixes used to make pivotal decisions, selecting one of those to mess with is then very cheap to specify. I use S to refer to those strings that are the ‘natural’ continuation of those cheap-to-specify prefixes.

There are, it seems to me, a bunch of other equally-complex TMs that want to make other strings that share that prefix more likely, including some that promote S itself. What the resulting balance looks like is unclear to me, but what’s clear is that the prior is malign with respect to that prefix—conditioning on that prefix gives you a distribution almost entirely controlled by these malign TMs. The ‘natural’ complexity of S, or of other strings that share the prefix, play almost no role in their priors.

The above is of course conditional on this exhaustive search being possible, which also relies on there being anyone in any universe that actually uses the prior to make decisions. Otherwise, we can’t select the prefixes that can be messed with.

Vlad Mikulik Jul 11, 2018, 4:47 AM
LW: 2 AF: 2
AF
in reply to: eric_langlois’s comment on: Clarifying Consequentialists in the Solomonoff Prior
The trigger sequence is a cool idea.
I want to add that the intended generator TM also needs to specify a start-to-read time, so there is symmetry there. Whatever method a TM needs to use to select the camera start time in the intended generator for the real world samples, it can also use in the simulated world with alien life, since for the scheme to work only the difference in complexity between the two matters.
There is additional flex in that unlike the intended generator, the reasoner TM can sample its universe simulation at any cheaply computable interval, giving the civilisation the option of choosing any amount of thinking they can perform between outputs, if they so choose.

Clarifying Consequentialists in the Solomonoff Prior

Vlad MikulikJul 11, 2018, 2:35 AM

20 points

16 comments6 min readLW link

Vlad Mikulik

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

De­cep­tive Alignment

The In­ner Align­ment Problem

Con­di­tions for Mesa-Optimization

Risks from Learned Op­ti­miza­tion: Introduction

Clar­ify­ing Con­se­quen­tial­ists in the Solomonoff Prior

Risks from Learned Optimization: Conclusion and Related Work

Deceptive Alignment

The Inner Alignment Problem

Conditions for Mesa-Optimization

Risks from Learned Optimization: Introduction

Clarifying Consequentialists in the Solomonoff Prior