However, we could instead define “intent alignment” as “the optimal policy of the mesa objective would be good for humans”.
I agree that we need a notion of “intent” that doesn’t require a purely behavioral notion of a model’s objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea “what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?” I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don’t have a good reason to believe this.)
I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior. If we actually want to understand “intent,” we have to understand what the heck intentions and goals actually are in humans and what they might look like in advanced ML systems. However, I do think this is a very good point you raise about intent alignment (that it should correspond to the model’s internal goals, objectives, intentions, etc.), and the need to be mindful of which version we’re using in a given context.
Also, I’m iffy on including the “all inputs”/optimality thing (I believe Rohin is, too)… it does have the nice property that it lets you reason without considering e.g. training setup, dataset, architecture, but we won’t actually have infinite data and optimal models in practice. So, I think it’s pretty important to model how different environments or datasets interact with the reward/objective function in producing the intentions and goals of our models.
Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).
I don’t think this is necessarily a crux between the generalization- and objective-driven approaches—if intentional behavior requires a mesa-objective, then humans can’t act “intentionally.” So we obviously want a notion of intent that applies to the messier middle cases of goal representation (between a literal mesa-objective and a purely implicit behavioral objective).
but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea “what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?” I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don’t have a good reason to believe this.)
I was surprised to see you saying that Rohin (and yourself) don’t expect mesa-optimizers to appear in practice. As I recently read this from a comment of his on Alex Flint’s “The ground for optimization” which seems to state pretty clearly that he does expect mesa-optimization from AGI development:
Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.
Argument for mesa optimization: Due to the complexity and noise in the real world, most economically useful tasks require setting up a robust optimizing system, rather than directly creating the target configuration state. (See also the importance of feedback for more on this intuition.) It seems likely that humans will find it easier to create algorithms that then find AGIs that can create these robust optimizing systems, rather than creating an algorithm that is directly an AGI.
(The previous argument also applies: this is basically just a generalization of the previous point to arbitrary AI systems, instead of only deep learning.)
I want to note that under this approach the notion of “search” and “mesa objective” are less natural, which I see as a pro of this approach (see also here): the argument is that we’ll get a general inner optimizing AI, but it doesn’t say much about what task that AI will be optimizing for (and it could be an optimizing AI that is retargetable by human instructions).
But that comment was from 2 years ago, whereas yours is less than a year old. So perhaps he changed views in the meantime? I’d be curious to hear/read more about why either of you don’t expect mesa-optimizers to appear in practice.
I agree that we need a notion of “intent” that doesn’t require a purely behavioral notion of a model’s objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea “what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?” I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don’t have a good reason to believe this.)
For myself, my reaction is “behavioral objectives also assume a system is well-described as EU maximizers”. In either case, you’re assuming that you can summarize a policy by a function it optimizes; the difference is whether you think the system itself thinks explicitly in those terms.
I haven’t engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don’t resonate at all with other people’s complaints, it seems.
For example, I don’t put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.
In this picture, there is no clear distinction between terminal values and instrumental values. Something is “more terminal” if you treat it as more fixed (you resolve contradictions by updating the other values), and “more instrumental” if its value is more changeable based on other things.
I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior.
(Possibly you should consider my “approximately coherent expectations” idea)
I haven’t engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don’t resonate at all with other people’s complaints, it seems.
For example, I don’t put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.
I agree that we need a notion of “intent” that doesn’t require a purely behavioral notion of a model’s objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea “what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?” I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don’t have a good reason to believe this.)
I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior. If we actually want to understand “intent,” we have to understand what the heck intentions and goals actually are in humans and what they might look like in advanced ML systems. However, I do think this is a very good point you raise about intent alignment (that it should correspond to the model’s internal goals, objectives, intentions, etc.), and the need to be mindful of which version we’re using in a given context.
Also, I’m iffy on including the “all inputs”/optimality thing (I believe Rohin is, too)… it does have the nice property that it lets you reason without considering e.g. training setup, dataset, architecture, but we won’t actually have infinite data and optimal models in practice. So, I think it’s pretty important to model how different environments or datasets interact with the reward/objective function in producing the intentions and goals of our models.
I don’t think this is necessarily a crux between the generalization- and objective-driven approaches—if intentional behavior requires a mesa-objective, then humans can’t act “intentionally.” So we obviously want a notion of intent that applies to the messier middle cases of goal representation (between a literal mesa-objective and a purely implicit behavioral objective).
I was surprised to see you saying that Rohin (and yourself) don’t expect mesa-optimizers to appear in practice. As I recently read this from a comment of his on Alex Flint’s “The ground for optimization” which seems to state pretty clearly that he does expect mesa-optimization from AGI development:
But that comment was from 2 years ago, whereas yours is less than a year old. So perhaps he changed views in the meantime? I’d be curious to hear/read more about why either of you don’t expect mesa-optimizers to appear in practice.
For myself, my reaction is “behavioral objectives also assume a system is well-described as EU maximizers”. In either case, you’re assuming that you can summarize a policy by a function it optimizes; the difference is whether you think the system itself thinks explicitly in those terms.
I haven’t engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don’t resonate at all with other people’s complaints, it seems.
For example, I don’t put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.
In this picture, there is no clear distinction between terminal values and instrumental values. Something is “more terminal” if you treat it as more fixed (you resolve contradictions by updating the other values), and “more instrumental” if its value is more changeable based on other things.
(Possibly you should consider my “approximately coherent expectations” idea)
Is this related to your post An Orthodox Case Against Utility Functions? It’s been on my to-read list for a while; I’ll be sure to give it a look now.
Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)