Even with a significantly improved definition of goal-directedness, I think we’d be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it’s only the first step. That being said, I think I’m more optimistic than you on the result, for a couple of reasons:
One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it’s possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive.
One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains.
Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness.
Can you elaborate on this?
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we’re bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some distributions over instances of the problem, but it’s really difficult to find a distribution that capture the instances you will meet in the real world. This means that you tend to be limited to worst case reasoning.
One cool way to address that is through smoothed complexity: the complexity for an instance x is the expected complexity over the distribution on instances created by adding some Gaussian noise to x. I wonder if we can get some guarantees like that, which might improve over worst-case reasoning.
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I’d be happy to hear it, but “systems performing complex tasks in complex environments have to pay that cost anyway” seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
Agreed. I don’t have such a story, but I think this is a good reframing of the crux underlying this line of argument.
I meant time as a function of data (I’m not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you’d need O(X).
For whatever reason, I thought about complexity depending on the size of the brain, which is really weird. But as complexity depending on the size of the data, I guess this makes more sense? I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
I agree than in principle we could decode the brain’s algorithms and say “actually, that’s quadratic time” or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don’t think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there’s nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
Agreed that this a pretty strong argument that complexity doesn’t preclude mesa-optimizers.
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn’t find a place where it felt super appropriate to discuss it.
Maybe in “Why this doesn’t seem to work” for pure computational complexity?
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
Ah, on this point, I very much agree.
I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn’t quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).
Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it’s only the first step. That being said, I think I’m more optimistic than you on the result, for a couple of reasons:
One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it’s possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive.
One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains.
Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness.
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some distributions over instances of the problem, but it’s really difficult to find a distribution that capture the instances you will meet in the real world. This means that you tend to be limited to worst case reasoning.
One cool way to address that is through smoothed complexity: the complexity for an instance x is the expected complexity over the distribution on instances created by adding some Gaussian noise to x. I wonder if we can get some guarantees like that, which might improve over worst-case reasoning.
Agreed. I don’t have such a story, but I think this is a good reframing of the crux underlying this line of argument.
For whatever reason, I thought about complexity depending on the size of the brain, which is really weird. But as complexity depending on the size of the data, I guess this makes more sense? I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
Agreed that this a pretty strong argument that complexity doesn’t preclude mesa-optimizers.
Maybe in “Why this doesn’t seem to work” for pure computational complexity?
Ah, on this point, I very much agree.
I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn’t quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).