I had a similar confusion when I first read Evan’s comment. I think the thing that obscures this discussion is the extent to which the word ‘learning’ is overloaded—so I’d vote taboo the term and use more concrete language.
Vlad Mikulik
You might want to look into NMF, which, unlike PCA/SVD, doesn’t aim to create an orthogonal projection. It works well for interpretability because its components cannot cancel each other out, which makes its features more intuitive to reason about. I think it is essentially what you want, although I don’t think it will allow you to find directly the ‘larger set of almost orthogonal vectors’ you’re looking for.
Specification gaming: the flip side of AI ingenuity
I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.
Re: inner alignment
I think it’s basically the right term. I guess in my mind I want to say something like, “Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap”, which shows how the two have slightly different shapes. But the difference isn’t really important.
Inner alignment gap? Inner objective gap?
I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.
I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.
In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.
We’re probably in agreement, but I’m not sure what exactly you mean by “retreat to malign generalisation”.
For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.
That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then systems that competently pursue them may lead to outcomes that are arbitrarily bad according to the reward. Humans don’t pose a counterexample to that, and it seems prima facie conceptually clarifying, so I wouldn’t throw it away. I’m not sure if you propose to do that, but strictly, that’s what “retreating to malign generalisation” could mean, as malign generalisation itself makes no reference to goals.
One might argue that until we have a good model of goal-directedness, Mesa≠Base reifies goals more than is warranted, so we should drop it. But I don’t think so – so long as one accepts goals as meaningful at all, the underlying model need only admit a distinction between the goal of a system and the criterion according to which a system was selected. I find it hard to imagine a model or view that wouldn’t allow this – this makes sense even in the intentional stance, whose metaphysics for goals is pretty minimal.
It’s a shame that Mesa≠Base is so entangled with Optimisers. When I think of mesa-optimisation, I tend to think more about the former than about the latter. I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers. The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.
I’m sympathetic to what I see as the message of this post: that talk of mesa-optimisation is too specific given that the practical worry is something like malign generalisation. I agree that it makes extra assumptions on top of that basic worry, which we might not want to make. I would like to see more focus on inner alignment than on mesa-optimisation as such. I’d also like to see a broader view of possible causes for malign generalisation, which doesn’t stick so closely to the analysis in our paper. (In hindsight our analysis could also have benefitted from taking a broader view, but that wasn’t very visible at the time.)
At the same time, speaking only in terms of malign generalisation (and dropping the extra theoretical assumptions of a more specific framework) is too limiting. I suspect that solutions to inner alignment will come from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment. I’m not sure that mesa-optimisation is the right view for that, but I do think that the right view will have something to do with goal-directedness.
By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.
I have now seen a few suggestions for environments that demonstrate misaligned mesa-optimisation, and this is one of the best so far. It combines being simple and extensible with being compelling as a demonstration of pseudo-alignment if it works (fails?) as predicted. I think that we will want to explore more sophisticated environments with more possible proxies later, but as a first working demo this seems very promising. Perhaps one could start even without the maze, just a gridworld with keys and boxes.
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy. There is room for philosophical disagreement there. Even with that, a working demo of this environment would in my opinion be a good thing, as we would have a concrete agent to disagree about.
Ah; this does seem to be an unfortunate confusion.
I didn’t intend to make ‘utility’ and ‘reward’ terminology – that’s what ‘mesa-‘ and ‘base’ objectives are for. I wasn’t aware of the terms being used in the technical sense as in your comment, so I wanted to use utility and reward as friendlier and familiar words for this intuition-building post. I am not currently inclined to rewrite the whole thing using different words because of this clash, but could add a footnote to clear this up. If the utility/reward distinction in your sense becomes accepted terminology, I’ll think about rewriting this.
That said, the distinctions we’re drawing appear to be similar. In your terminology, a utility-maximising agent is an agent which has an internal representation of a goal which it pursues. Whereas a reward-maximising agent does not have a rich internal goal representation but instead a kind of pointer to the external reward signal. To me this suggests your utility/reward tracks a very similar, if not the same, distinction between internal/external that I want to track, but with a difference in emphasis. When either of us says ‘utility ≠ reward’, I think we mean the same distinction, but what we want to draw from that distinction is different. Would you disagree?
Thanks for raising this. While I basically agree with evhub on this, I think it is unfortunate that the linguistic justification is messed up as it is. I’ll try to amend the post to show a bit more sensitivity to the Greek not really working like intended.
Though I also think that “the opposite of meta”-optimiser is basically the right concept, I feel quite dissatisfied with the current terminology, with respect to both the “mesa” and the “optimiser” parts. This is despite us having spent a substantial amount of time and effort on trying to get the terminology right! My takeaway is that it’s just hard to pick terms that are both non-confusing and evocative, especially when naming abstract concepts. (And I don’t think we did that badly, all things considered.)
If you have ideas on how to improve the terms, I would like to hear them!
Utility ≠ Reward
2-D Robustness
You’re completely right; I don’t think we meant to have ‘more formally’ there.
To some extent, but keep in mind that in another sense, the behavioural objective of maximising paperclips is totally consistent with playing along with the base objective for a while and then defecting. So I’m not sure the behaviour/mesa- distinction alone does the work you want it to do even in that case.
I’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book?
P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser.
P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap between that work and the idea we’ve sketched out in our third post on a general way of understanding pseudo-alignment. Our objective graph framework is less developed than yours, so perhaps your machinery could be applied there to get a more precise analysis?
Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?
Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is useful from the point of view of making the abstraction more robust. The former doesn’t make clear what makes the abstraction “work”, and so when to expect it to fail. The latter will at least tell you what kind of failures to expect in the abstraction: places where the evaluation of X doesn’t connect to the rest of the system like it’s supposed to. In particular, you’re right that if the learned environment model doesn’t generalise, the mesa-objective won’t be predictive of behaviour. But that’s actually a prediction of taking this view. On the other hand, it is unclear if taking the behavioural view would predict that the system will change its behaviour off-distribution (partially, because it’s unclear what exactly grounds the similarities in behaviour on-distribution).
I think it definitely is useful to also think about the behavioural objective in the way you describe, because the later concerns we raise basically do also translate to coherent behavioural objectives. And I welcome more work trying to untangle these concepts from one another, or trying to dissolve any of them as unnecessary. I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.
Re: DQN
You’re also right to point out DQN as an interesting edge case. But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function. The Q function is regressed to the episode returns. If the learning goes well, the Q function is literally representing the agent’s objective (indeed, it’s not really selected to maximise return; its selected to be accurate at predicting return). Contrast this with e.g. policy optimisation trained agents, which are not supposed to directly represent an objective, but are supposed to score well on it. (Someone good at running RL experiments maybe should look into comparing the coherence of revealed preferences of DQN agents with PPO agents. I’d read that paper.)
I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.
That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.
If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.
My tentative sense is that for AI to be interpretable (and safer) we want it to be the latter kind, but given enough compute the former kind of AI will give better results, other things being equal.
Here, what I mean by low-level architecture is something like “we’ll use lots of LSTMs instead of lots of plain RNNs, but keep the model structure simple: plug in the inputs, pass it through some layers, and read out the action probabilities”, and high-level is something like “let’s organise the model using this enormous flowchart with all of these various pieces that each are designed to take a particular role; here’s the observation embedding, here’s the search in latent model space, here’s the …”
This is not true of RL algorithms in general—If I want, I can make weight updates after every observation. And yet, I suspect that if I meta-train a recurrent policy using such an algorithm on a distribution of bandit tasks, I will get a ‘learning-to-learn’ style policy.
So I think this is a less fundamental reason, though it is true in off-policy RL.