Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don’t want to get into exactly what “alignment” means.)
This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective.
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)… I think it’s a mistake to think of only mesa-optimizers as having “intent” or being “goal-oriented” unless we start to be more inclusive about what we mean by “mesa-optimizer” and “mesa-objective.” I don’t think those terms as defined in RFLO actually capture humans, but I definitely want to say that we’re “goal-oriented” and have “intent.”
But the graph structure makes perfect sense, I just am doing the mental substitution of “intent alignment means ‘what the model is actually trying to do’ is aligned with ‘what we want it to do’.” (Similar for inner robustness.)
However, I’m not confident that the details of Evan’s locutions are quite right. For example, should alignment be tested only in terms of the very best policy?
I also don’t think optimality is a useful condition in alignment definitions. (Also, a similarly weird move is pulled with “objective robustness,” which is defined in terms of the optimal policy for a model’s behavioral objective… so you’d have to get the behavioral objective, which is specific to your actual policy, and find the actually optimal policy for that objective, to determine objective robustness?)
I find myself thinking that objective robustness is actually what I mean by the inner alignment problem. Abergal voiced similar thoughts. But this makes it seem unfortunate that “inner alignment” refers specifically to the thing where there are mesa-optimizers. I’m not sure what to do about this.
Yeah, I think I’d also wish we could collectively agree to redefine inner alignment to be more like objective robustness (or at least be more inclusive of the kinds of inner goals humans have). But I’ve been careful not to use the term to refer to anything except mesa-optimizers, partially in order to be consistent with Evan’s terminology, but primarily not to promote unnecessary confusion with those who strongly associate “inner alignment” with mesa-optimization (although they could also be using a much looser conception of mesa-optimization, if they consider humans to be mesa-optimizers, in which case “inner alignment” pretty much points at the thing I’d want it to point at).
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)… I think it’s a mistake to think of only mesa-optimizers as having “intent” or being “goal-oriented” unless we start to be more inclusive about what we mean by “mesa-optimizer” and “mesa-objective.” I don’t think those terms as defined in RFLO actually capture humans, but I definitely want to say that we’re “goal-oriented” and have “intent.”
But the graph structure makes perfect sense, I just am doing the mental substitution of “intent alignment means ‘what the model is actually trying to do’ is aligned with ‘what we want it to do’.” (Similar for inner robustness.)
I too am a fan of broadening this a bit, but I am not sure how to.
I didn’t really take the time to try and define “mesa-objective” here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.
I agree with your point about using “does this definition include humans” as a filter, and I think it would be easy to mess that up (and I wasn’t thinking about it explicitly until you raised the point).
However, I think possibly you want a very behavioral definition of mesa-objective. If that’s true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.
I didn’t really take the time to try and define “mesa-objective” here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.
This sounds reasonable and similar to the kinds of ideas for understanding agents’ goals as cognitively implemented that I’ve been exploring recently.
However, I think possibly you want a very behavioral definition of mesa-objective. If that’s true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.
The funny thing is I am actually very unsatisfied with a purely behavioral notion of a model’s objective, since a deceptive model would obviously externally appear to be a non-deceptive model in training. I just don’t think there will be one part of the network we can point to and clearly interpret as being some objective function that the rest of the system’s activity is optimizing. Even though I am partial to the generalization focused approach (in part because it kind of widens the goal posts with the “acceptability” vs. “give the model exactly the correct goal” thing), I still would like to have a more cognitive understanding of a system’s “goals” because that seems like one of the best ways to make good predictions about how the system will generalize under distributional shift. I’m not against assuming some kind of explicit representation of goal content within a system (for sufficiently powerful systems); I’m just against assuming that that content will look like a mesa-objective as originally defined.
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)… I think it’s a mistake to think of only mesa-optimizers as having “intent” or being “goal-oriented” unless we start to be more inclusive about what we mean by “mesa-optimizer” and “mesa-objective.” I don’t think those terms as defined in RFLO actually capture humans, but I definitely want to say that we’re “goal-oriented” and have “intent.”
But the graph structure makes perfect sense, I just am doing the mental substitution of “intent alignment means ‘what the model is actually trying to do’ is aligned with ‘what we want it to do’.” (Similar for inner robustness.)
I also don’t think optimality is a useful condition in alignment definitions. (Also, a similarly weird move is pulled with “objective robustness,” which is defined in terms of the optimal policy for a model’s behavioral objective… so you’d have to get the behavioral objective, which is specific to your actual policy, and find the actually optimal policy for that objective, to determine objective robustness?)
Yeah, I think I’d also wish we could collectively agree to redefine inner alignment to be more like objective robustness (or at least be more inclusive of the kinds of inner goals humans have). But I’ve been careful not to use the term to refer to anything except mesa-optimizers, partially in order to be consistent with Evan’s terminology, but primarily not to promote unnecessary confusion with those who strongly associate “inner alignment” with mesa-optimization (although they could also be using a much looser conception of mesa-optimization, if they consider humans to be mesa-optimizers, in which case “inner alignment” pretty much points at the thing I’d want it to point at).
I too am a fan of broadening this a bit, but I am not sure how to.
I didn’t really take the time to try and define “mesa-objective” here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.
I agree with your point about using “does this definition include humans” as a filter, and I think it would be easy to mess that up (and I wasn’t thinking about it explicitly until you raised the point).
However, I think possibly you want a very behavioral definition of mesa-objective. If that’s true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.
This sounds reasonable and similar to the kinds of ideas for understanding agents’ goals as cognitively implemented that I’ve been exploring recently.
The funny thing is I am actually very unsatisfied with a purely behavioral notion of a model’s objective, since a deceptive model would obviously externally appear to be a non-deceptive model in training. I just don’t think there will be one part of the network we can point to and clearly interpret as being some objective function that the rest of the system’s activity is optimizing. Even though I am partial to the generalization focused approach (in part because it kind of widens the goal posts with the “acceptability” vs. “give the model exactly the correct goal” thing), I still would like to have a more cognitive understanding of a system’s “goals” because that seems like one of the best ways to make good predictions about how the system will generalize under distributional shift. I’m not against assuming some kind of explicit representation of goal content within a system (for sufficiently powerful systems); I’m just against assuming that that content will look like a mesa-objective as originally defined.
Seems fair. I’m similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.