I think you have two main points here, which require two separate responses. I’ll do them opposite the order you presented them.
Your second point, paraphrased: 90% of anything is crap, that doesn’t mean there’s no progress. I’m totally on board with that. But in alignment today, it’s not just that 90% of the work is crap, it’s that the most memetically successful work is crap. It’s not the raw volume of crap that’s the issue so much as the memetic selection pressures.
Your first point, paraphrased: progress toward the the hard problem does not necessarily immediately look like tackling the meat of the hard problem directly. I buy that to some extent, but there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem, whether direct or otherwise. And indeed, I would claim that prosaic alignment as a category is a case where people are not making progress on the hard problems, whether direct or otherwise. In particular, one relevant criterion to look at here is generalizability: is the work being done sufficiently general/robust that it will still be relevant once the rest of the problem is solved (and multiple things change in not-yet-predictable ways in order to solve the rest of the problem)? See e.g. this recent comment for an object-level example of what I mean.
in capabilities, the most memetically successful things were for a long time not the things that actually worked. for a long time, people would turn their noses at the idea of simply scaling up models because it wasn’t novel. the papers which are in retrospect the most important did not get that much attention at the time (e.g gpt2 was very unpopular among many academics; the Kaplan scaling laws paper was almost completely unnoticed when it came out; even the gpt3 paper went under the radar when it first came out.)
one example of a thing within prosaic alignment that i feel has the possibility of generalizability is interpretability. again, if we take the generalizability criteria and map it onto the capabilities analogy, it would be something like scalability—is this a first step towards something that can actually do truly general reasoning, or is it just a hack that will no longer be relevant once we discover the truly general algorithm that subsumes the hacks? if it is on the path, can we actually shovel enough compute into it (or its successor algorithms) to get to agi in practice, or do we just need way more compute than is practical? and i think at the time of gpt2 these were completely unsettled research questions! it was actually genuinely unclear whether writing articles about ovid’s unicorn is a genuine first step towards agi, or just some random amusement that will fade into irrelevancy. i think interp is in a similar position where it could work out really well and eventually become the thing that works, or it could just be a dead end.
If you’re thinking mainly about interp, then I basically agree with what you’ve been saying. I don’t usually think of interp as part of “prosaic alignment”, it’s quite different in terms of culture and mindset and it’s much closer to what I imagine a non-streetlight-y field of alignment would look like. 90% of it is crap (usually in streetlight-y ways), but the memetic selection pressures don’t seem too bad.
If we had about 10x more time than it looks like we have, then I’d say the field of interp is plausibly on track to handle the core problems of alignment.
ok good that we agree interp might plausibly be on track. I don’t really care to argue about whether it should count as prosaic alignment or not. I’d further claim that the following (not exhaustive) are also plausibly good (I’ll sketch each out for the avoidance of doubt because sometimes people use these words subtly differently):
model organisms—trying to probe the minimal sets of assumptions to get various hypothesized spicy alignment failures seems good. what is the least spoonfed demonstration of deceptive alignment we can get that is analogous mechanistically to the real deal? to what extent can we observe early signs of the prerequisites in current models? which parts of the deceptive alignment arguments are most load bearing?
science of generalization—in practice, why do NNs sometimes generalize and sometimes not? why do some models generalize better than others? In what ways are humans better or worse than NNs at generalizing? can we understand this more deeply without needing mechanistic understanding? (all closely related to ELK)
goodhart robustness—can you make reward models which are calibrated even under adversarial attack, so that when you optimize them really hard, you at least never catastrophically goodhart them?
scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it’s maybe a bit cost uncompetitive)?
All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I’ve ever heard of anyone actually doing in any of those categories.
That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors.
EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your “science of generalization” category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.
I think we would all be interested to hear you elaborate on why you think these approaches have approximately no value. Perhaps this will be in a follow-up post.
All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI).
It’s difficult for me to understand how this could be “basically useless in practice” for:
scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it’s maybe a bit cost uncompetitive)?
It seems to me you’d want to understand and strongly show how and why different approaches here fail, and in any world where you have something like “outsourcing alignment research” you want some form of oversight.
1: Can you explain how generalization of NNs relates to ELK? I can see that it can help with ELK (if you know a reporter generalizes, you can train it on labeled situations and apply it more broadly) or make ELK unnecessary (if weak to strong generalization perfectly works and we never need to understand complex scenarios). But I’m not sure if that’s what you mean.
2: How is goodhart robustness relevant? Most models today don’t seem to use reward functions in deployment, and in training the researchers can control how hard they optimize these functions, so I don’t understand why they necessarily need to be robust under strong optimization.
I think you have two main points here, which require two separate responses. I’ll do them opposite the order you presented them.
Your second point, paraphrased: 90% of anything is crap, that doesn’t mean there’s no progress. I’m totally on board with that. But in alignment today, it’s not just that 90% of the work is crap, it’s that the most memetically successful work is crap. It’s not the raw volume of crap that’s the issue so much as the memetic selection pressures.
Your first point, paraphrased: progress toward the the hard problem does not necessarily immediately look like tackling the meat of the hard problem directly. I buy that to some extent, but there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem, whether direct or otherwise. And indeed, I would claim that prosaic alignment as a category is a case where people are not making progress on the hard problems, whether direct or otherwise. In particular, one relevant criterion to look at here is generalizability: is the work being done sufficiently general/robust that it will still be relevant once the rest of the problem is solved (and multiple things change in not-yet-predictable ways in order to solve the rest of the problem)? See e.g. this recent comment for an object-level example of what I mean.
in capabilities, the most memetically successful things were for a long time not the things that actually worked. for a long time, people would turn their noses at the idea of simply scaling up models because it wasn’t novel. the papers which are in retrospect the most important did not get that much attention at the time (e.g gpt2 was very unpopular among many academics; the Kaplan scaling laws paper was almost completely unnoticed when it came out; even the gpt3 paper went under the radar when it first came out.)
one example of a thing within prosaic alignment that i feel has the possibility of generalizability is interpretability. again, if we take the generalizability criteria and map it onto the capabilities analogy, it would be something like scalability—is this a first step towards something that can actually do truly general reasoning, or is it just a hack that will no longer be relevant once we discover the truly general algorithm that subsumes the hacks? if it is on the path, can we actually shovel enough compute into it (or its successor algorithms) to get to agi in practice, or do we just need way more compute than is practical? and i think at the time of gpt2 these were completely unsettled research questions! it was actually genuinely unclear whether writing articles about ovid’s unicorn is a genuine first step towards agi, or just some random amusement that will fade into irrelevancy. i think interp is in a similar position where it could work out really well and eventually become the thing that works, or it could just be a dead end.
If you’re thinking mainly about interp, then I basically agree with what you’ve been saying. I don’t usually think of interp as part of “prosaic alignment”, it’s quite different in terms of culture and mindset and it’s much closer to what I imagine a non-streetlight-y field of alignment would look like. 90% of it is crap (usually in streetlight-y ways), but the memetic selection pressures don’t seem too bad.
If we had about 10x more time than it looks like we have, then I’d say the field of interp is plausibly on track to handle the core problems of alignment.
ok good that we agree interp might plausibly be on track. I don’t really care to argue about whether it should count as prosaic alignment or not. I’d further claim that the following (not exhaustive) are also plausibly good (I’ll sketch each out for the avoidance of doubt because sometimes people use these words subtly differently):
model organisms—trying to probe the minimal sets of assumptions to get various hypothesized spicy alignment failures seems good. what is the least spoonfed demonstration of deceptive alignment we can get that is analogous mechanistically to the real deal? to what extent can we observe early signs of the prerequisites in current models? which parts of the deceptive alignment arguments are most load bearing?
science of generalization—in practice, why do NNs sometimes generalize and sometimes not? why do some models generalize better than others? In what ways are humans better or worse than NNs at generalizing? can we understand this more deeply without needing mechanistic understanding? (all closely related to ELK)
goodhart robustness—can you make reward models which are calibrated even under adversarial attack, so that when you optimize them really hard, you at least never catastrophically goodhart them?
scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it’s maybe a bit cost uncompetitive)?
All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I’ve ever heard of anyone actually doing in any of those categories.
That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors.
EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your “science of generalization” category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.
I think we would all be interested to hear you elaborate on why you think these approaches have approximately no value. Perhaps this will be in a follow-up post.
It’s difficult for me to understand how this could be “basically useless in practice” for:
It seems to me you’d want to understand and strongly show how and why different approaches here fail, and in any world where you have something like “outsourcing alignment research” you want some form of oversight.
Thanks for the list! I have two questions:
1: Can you explain how generalization of NNs relates to ELK? I can see that it can help with ELK (if you know a reporter generalizes, you can train it on labeled situations and apply it more broadly) or make ELK unnecessary (if weak to strong generalization perfectly works and we never need to understand complex scenarios). But I’m not sure if that’s what you mean.
2: How is goodhart robustness relevant? Most models today don’t seem to use reward functions in deployment, and in training the researchers can control how hard they optimize these functions, so I don’t understand why they necessarily need to be robust under strong optimization.