I have fairly mixed feelings about this post. On one hand, I agree that it’s easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they’ll arise, otherwise the argument that ‘we can’t yet rule them out, so we should prioritise trying to rule them out’ is privileging the hypothesis.
Secondly, it seems like you’re heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human’s behaviour, or understand psychology more generally. So what’s the positive case for studying mesa-optimisation in big neural networks using formal tools?
In particular, I’d say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.
I agree with much of this. I over-sold the “absence of negative story” story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, “mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn’t we expect to see them?”—and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the exception of John’s story, which did point to important gears.)
With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable. However, that’s not the sense the post conveyed overall, so I get it. I am concretely trying to convey pessimism about a specific sort of less-formal work: work which tries to block plausibility stories. Possibly you disagree about this kind of work.
WRT your argument for informal work, well, I agree in principle (trying to push toward more formal work myself has so far revealed challenges which I think more informal conceptual work could help with), but I’m nonetheless optimistic at the moment that we can define formal problems which won’t be a waste of time to work on. And out of informal work, what seems most interesting is whatever pushes toward formality.
Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn’t we expect to see them?
I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I’ll save for another time).
With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.
I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I’m still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail—but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it’s easier to tell where any disagreement lies.
To me, the post as written seems like enough to spell out my optimism… there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn’t focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.
I have fairly mixed feelings about this post. On one hand, I agree that it’s easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they’ll arise, otherwise the argument that ‘we can’t yet rule them out, so we should prioritise trying to rule them out’ is privileging the hypothesis.
Secondly, it seems like you’re heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human’s behaviour, or understand psychology more generally. So what’s the positive case for studying mesa-optimisation in big neural networks using formal tools?
In particular, I’d say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.
I agree with much of this. I over-sold the “absence of negative story” story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, “mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn’t we expect to see them?”—and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the exception of John’s story, which did point to important gears.)
With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable. However, that’s not the sense the post conveyed overall, so I get it. I am concretely trying to convey pessimism about a specific sort of less-formal work: work which tries to block plausibility stories. Possibly you disagree about this kind of work.
WRT your argument for informal work, well, I agree in principle (trying to push toward more formal work myself has so far revealed challenges which I think more informal conceptual work could help with), but I’m nonetheless optimistic at the moment that we can define formal problems which won’t be a waste of time to work on. And out of informal work, what seems most interesting is whatever pushes toward formality.
I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I’ll save for another time).
I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I’m still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail—but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it’s easier to tell where any disagreement lies.
To me, the post as written seems like enough to spell out my optimism… there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn’t focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.