Here is my attempt at a detailed peer-review feedback. I admit that I’m more excited by doing this because you’re asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).
One thing I really like is the multiple “failure” stories at the beginning. It’s usually frustrating in posts like that to see people argue against position/arguments which are not written anywhere. Here we can actually see the problematic arguments.
I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection. I proceeded to offer a plausibility argument for a total disconnect between the two, such that even these course-grained adjustments would fail.
I’m not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.
Again, this strikes me as ignoring the fundamental problem, that we have little to no idea when mesa-optimizers can arise, that we lack formal tools for the analysis of such questions, and that what formal tools we might have thought to apply, have failed to yield any such results.
Completely agreed. I always find such arguments unconvincing, not because I don’t see where the people using them are coming from, but because such impossibility results require a way better understanding of what mesa-optimizers are and do that we have.
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
Agreed too. I always find that weird when people use that argument, because it seems agreed upon in the field for a long time that there are probably simple goal-directed process in the search spaces. Like I can find a post from Paul’s blog in 2012 where he writes:
This discussion has been brief and has necessarily glossed over several important difficulties. One difficulty is the danger of using computationally unbounded brute force search, given the possibility of short programs which exhibit goal-oriented behavior.
Defining Mesa-Optimization
There’s one approach that you haven’t described (although it’s a bit close to your last one) and which I am particularly excited about: finding an operationalization of goal-directedness, and just define/redefine mesa-optimizers as learned goal-directed agents. My interpretation of RLO is that it’s arguing that search for simple competent programs will probably find a goal-directed system AND that it might have a simple structure “parametrized with a goal” (so basically an inner optimizer). This last assumption was really relevant for making argument about the sort of architecture likely to evolved by gradient descent. But I don’t think the arguments are tight enough to convince us that the learned goal-directed systems will necessarily have this kind of structure, and the sort of problems mentioned seems just as salient for other goal-directed systems.
I also believe that we’re not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree. Even if I’m wrong about being able to formalize goal-directedness, I’m pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.
The concept of generalization accuracy misses important issues. For example, a guaranteed very low frequency of errors might still allow an error to be strategically inserted at a very important time.
I really like this argument against using generalization. To be clear on whether I understand you, do you mean that even with very limited error, a mesa-optimizer/goal-directed agent could bid its time and use a single action well-placed to make a catastrophic treacherous turn?
Occam’s razor should only make you think one of the shortest hypotheses that fits your data is going to be correct, not necessarily the shortest one. So, this kind of thinking does not directly imply a lack of malign mesa-optimization in the shortest hypothesis.
A bit tangential, but this line of argument is exactly why I find the research of the loss landscape of a neural net frightening for inner alignment. What people try to prove for ML purpose is that there is no or few “bad” (high-loss) minima, where bad means high loss. But they’re fine with many “good” (low-loss) local minima, and usually find many of them. Except that this is a terrible new for inner alignment, because the more “good” local minima, the more risk some of them are deceptive mesa-optimizers.
Mutual information between predicting reality and agency may mean mesa-optimizers don’t have to spend extra bits on goal content and planning. In particular, if the reality being predicted contains goal-driven agents, then a mesa-optimizer doesn’t have to spend extra bits on these things, because it already needs to describe them in order to predict well.
I hadn’t thought of it that way, but that does capture nicely the intuition that any good enough agent for a sufficiently complex task will be able to model humans and deception, among other things. That being said, wouldn’t the mesa-optimizer still have to pay the price to maintain at all time two goals, and to keep track of what things means related to both? Or are you arguing that this mutual information means that the mesa-optimizer will already be modeling many goal-directed systems, and so can just reuse that knowledge/information?
Pure Computational Complexity
About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it’s not redescribed in the intro or conclusion):
Furthermore, in the context of machine learning, this analysis suggests that a time complexity penalty (as opposed to a description length penalty) is a double-edged sword. In the second post, we suggested that penalizing time complexity might serve to reduce the likelihood of mesa-optimization. However, the above suggests that doing so would also promote pseudo-alignment in those cases where mesa-optimizers do arise. If the cost of fully modeling the base objective in the mesa-optimizer is large, then a pseudo-aligned mesa-optimizer might be preferred simply because it reduces time complexity, even if it would underperform a robustly aligned mesa-optimizer without such a penalty.
Humans are essentially linear-time algorithms, in the sense that we take the same maximum amount of processing power (ie, that of the human brain) to produce each next output. Anything which produces linearly much output has to do so in at least linear time. So, Levinf-complexity can’t rule out humanlike intelligence.
I don’t understand what you’re saying. Your first sentence seems to point out that humans are constant-time, not linear time. An algorithm for a fixed sized is constant time, after all. The issue here is that we don’t have a scaled version of the algorithms humans are solving (analogous to generalized games). So we can’t discuss the asymptotic complexity of human-brain algorithms. But maybe you actually have an argument related to that which I missed?
One point about the time/description complexity penalty that I feel you don’t point enough is that even if there was a threshold under which mesa-optimization doesn’t appear, maybe it’s just too low to be competitive. That’s my main internal reason to doubt complexity penalties as a solution to the emergence of mesa-optimizers.
A Note on the Consensus Algorithm
As someone who has been unconvinced with this proposal as a solution for inner alignment, but didn’t take the time to express exactly why, I feel like you did a pretty nice work, and probably what I will point people to when they ask about this post.
I also believe that we’re not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree.
Even with a significantly improved definition of goal-directedness, I think we’d be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
But I’m happy to include your approach in the final document!
Even if I’m wrong about being able to formalize goal-directedness, I’m pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.
Can you elaborate on this?
To be clear on whether I understand you, do you mean that even with very limited error, a mesa-optimizer/goal-directed agent could bid its time and use a single action well-placed to make a catastrophic treacherous turn?
Right. Low total error for, eg, imitation learning, might be associated with catastrophic outcomes. This is partly due to the way imitation learning is readily measured in terms of predictive accuracy, when what we really care about is expected utility (although we can’t specify our utility function, which is one reason we may want to lean on imitation, of course).
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we’re bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
That being said, wouldn’t the mesa-optimizer still have to pay the price to maintain at all time two goals, and to keep track of what things means related to both? Or are you arguing that this mutual information means that the mesa-optimizer will already be modeling many goal-directed systems, and so can just reuse that knowledge/information?
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I’d be happy to hear it, but “systems performing complex tasks in complex environments have to pay that cost anyway” seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
I don’t understand what you’re saying. Your first sentence seems to point out that humans are constant-time, not linear time. An algorithm for a fixed sized is constant time, after all.
I meant time as a function of data (I’m not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you’d need O(X).
A memoryless alg could be constant time; IE, even though you have and X-long history, you just need to feed it the most recent thing, so its response time is not a function of X. Similarly with finite context windows.
The issue here is that we don’t have a scaled version of the algorithms humans are solving (analogous to generalized games). So we can’t discuss the asymptotic complexity of human-brain algorithms.
I agree than in principle we could decode the brain’s algorithms and say “actually, that’s quadratic time” or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don’t think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there’s nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
But maybe you actually have an argument related to that which I missed?
I think the crux here is what we’re measuring runtime as-a-function-of. LMK if you still think something else is going on.
About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it’s not redescribed in the intro or conclusion):
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn’t find a place where it felt super appropriate to discuss it.
One point about the time/description complexity penalty that I feel you don’t point enough is that even if there was a threshold under which mesa-optimization doesn’t appear, maybe it’s just too low to be competitive. That’s my main internal reason to doubt complexity penalties as a solution to the emergence of mesa-optimizers.
Right. I just didn’t discuss this due to wanting to get this out as a quick sketch of where I’m going.
Even with a significantly improved definition of goal-directedness, I think we’d be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it’s only the first step. That being said, I think I’m more optimistic than you on the result, for a couple of reasons:
One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it’s possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive.
One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains.
Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness.
Can you elaborate on this?
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we’re bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some distributions over instances of the problem, but it’s really difficult to find a distribution that capture the instances you will meet in the real world. This means that you tend to be limited to worst case reasoning.
One cool way to address that is through smoothed complexity: the complexity for an instance x is the expected complexity over the distribution on instances created by adding some Gaussian noise to x. I wonder if we can get some guarantees like that, which might improve over worst-case reasoning.
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I’d be happy to hear it, but “systems performing complex tasks in complex environments have to pay that cost anyway” seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
Agreed. I don’t have such a story, but I think this is a good reframing of the crux underlying this line of argument.
I meant time as a function of data (I’m not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you’d need O(X).
For whatever reason, I thought about complexity depending on the size of the brain, which is really weird. But as complexity depending on the size of the data, I guess this makes more sense? I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
I agree than in principle we could decode the brain’s algorithms and say “actually, that’s quadratic time” or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don’t think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there’s nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
Agreed that this a pretty strong argument that complexity doesn’t preclude mesa-optimizers.
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn’t find a place where it felt super appropriate to discuss it.
Maybe in “Why this doesn’t seem to work” for pure computational complexity?
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
Ah, on this point, I very much agree.
I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn’t quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).
I admit that I’m more excited by doing this because you’re asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).
Thanks!
I’m not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.
Right. By “no connection” I specifically mean “we have no strong reason to posit any specific predictions we can make about mesa-objectives from outer objectives or other details of training”—at least not for training regimes of practical interest. (I will consider this detail for revision.)
I could have also written down my plausibility argument (that there is actually “no connection”), but probably that just distracts from the point here.
Thanks for the post!
Here is my attempt at a detailed peer-review feedback. I admit that I’m more excited by doing this because you’re asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).
One thing I really like is the multiple “failure” stories at the beginning. It’s usually frustrating in posts like that to see people argue against position/arguments which are not written anywhere. Here we can actually see the problematic arguments.
I’m not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.
Completely agreed. I always find such arguments unconvincing, not because I don’t see where the people using them are coming from, but because such impossibility results require a way better understanding of what mesa-optimizers are and do that we have.
Agreed too. I always find that weird when people use that argument, because it seems agreed upon in the field for a long time that there are probably simple goal-directed process in the search spaces. Like I can find a post from Paul’s blog in 2012 where he writes:
There’s one approach that you haven’t described (although it’s a bit close to your last one) and which I am particularly excited about: finding an operationalization of goal-directedness, and just define/redefine mesa-optimizers as learned goal-directed agents. My interpretation of RLO is that it’s arguing that search for simple competent programs will probably find a goal-directed system AND that it might have a simple structure “parametrized with a goal” (so basically an inner optimizer). This last assumption was really relevant for making argument about the sort of architecture likely to evolved by gradient descent. But I don’t think the arguments are tight enough to convince us that the learned goal-directed systems will necessarily have this kind of structure, and the sort of problems mentioned seems just as salient for other goal-directed systems.
I also believe that we’re not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree. Even if I’m wrong about being able to formalize goal-directedness, I’m pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.
I really like this argument against using generalization. To be clear on whether I understand you, do you mean that even with very limited error, a mesa-optimizer/goal-directed agent could bid its time and use a single action well-placed to make a catastrophic treacherous turn?
A bit tangential, but this line of argument is exactly why I find the research of the loss landscape of a neural net frightening for inner alignment. What people try to prove for ML purpose is that there is no or few “bad” (high-loss) minima, where bad means high loss. But they’re fine with many “good” (low-loss) local minima, and usually find many of them. Except that this is a terrible new for inner alignment, because the more “good” local minima, the more risk some of them are deceptive mesa-optimizers.
I hadn’t thought of it that way, but that does capture nicely the intuition that any good enough agent for a sufficiently complex task will be able to model humans and deception, among other things. That being said, wouldn’t the mesa-optimizer still have to pay the price to maintain at all time two goals, and to keep track of what things means related to both? Or are you arguing that this mutual information means that the mesa-optimizer will already be modeling many goal-directed systems, and so can just reuse that knowledge/information?
About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it’s not redescribed in the intro or conclusion):
I don’t understand what you’re saying. Your first sentence seems to point out that humans are constant-time, not linear time. An algorithm for a fixed sized is constant time, after all. The issue here is that we don’t have a scaled version of the algorithms humans are solving (analogous to generalized games). So we can’t discuss the asymptotic complexity of human-brain algorithms. But maybe you actually have an argument related to that which I missed?
One point about the time/description complexity penalty that I feel you don’t point enough is that even if there was a threshold under which mesa-optimization doesn’t appear, maybe it’s just too low to be competitive. That’s my main internal reason to doubt complexity penalties as a solution to the emergence of mesa-optimizers.
As someone who has been unconvinced with this proposal as a solution for inner alignment, but didn’t take the time to express exactly why, I feel like you did a pretty nice work, and probably what I will point people to when they ask about this post.
Even with a significantly improved definition of goal-directedness, I think we’d be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
But I’m happy to include your approach in the final document!
Can you elaborate on this?
Right. Low total error for, eg, imitation learning, might be associated with catastrophic outcomes. This is partly due to the way imitation learning is readily measured in terms of predictive accuracy, when what we really care about is expected utility (although we can’t specify our utility function, which is one reason we may want to lean on imitation, of course).
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we’re bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I’d be happy to hear it, but “systems performing complex tasks in complex environments have to pay that cost anyway” seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
I meant time as a function of data (I’m not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you’d need O(X).
A memoryless alg could be constant time; IE, even though you have and X-long history, you just need to feed it the most recent thing, so its response time is not a function of X. Similarly with finite context windows.
I agree than in principle we could decode the brain’s algorithms and say “actually, that’s quadratic time” or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don’t think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there’s nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
I think the crux here is what we’re measuring runtime as-a-function-of. LMK if you still think something else is going on.
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn’t find a place where it felt super appropriate to discuss it.
Right. I just didn’t discuss this due to wanting to get this out as a quick sketch of where I’m going.
Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it’s only the first step. That being said, I think I’m more optimistic than you on the result, for a couple of reasons:
One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it’s possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive.
One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains.
Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness.
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some distributions over instances of the problem, but it’s really difficult to find a distribution that capture the instances you will meet in the real world. This means that you tend to be limited to worst case reasoning.
One cool way to address that is through smoothed complexity: the complexity for an instance x is the expected complexity over the distribution on instances created by adding some Gaussian noise to x. I wonder if we can get some guarantees like that, which might improve over worst-case reasoning.
Agreed. I don’t have such a story, but I think this is a good reframing of the crux underlying this line of argument.
For whatever reason, I thought about complexity depending on the size of the brain, which is really weird. But as complexity depending on the size of the data, I guess this makes more sense? I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
Agreed that this a pretty strong argument that complexity doesn’t preclude mesa-optimizers.
Maybe in “Why this doesn’t seem to work” for pure computational complexity?
Ah, on this point, I very much agree.
I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn’t quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).
Thanks!
Right. By “no connection” I specifically mean “we have no strong reason to posit any specific predictions we can make about mesa-objectives from outer objectives or other details of training”—at least not for training regimes of practical interest. (I will consider this detail for revision.)
I could have also written down my plausibility argument (that there is actually “no connection”), but probably that just distracts from the point here.
(More later!)