This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we’ll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.
Yep. This post is not for me but I’ll say a thing that annoyed me anyway:
… and Carol’s thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.
Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. “sensor tampering”) or giving novel arguments that problems are difficult is socially rewarded.)
Yes, absolutely. Five years ago, people were more honest about it, saying ~explicitly and out loud “ah, the real problems are too difficult; and I must eat and have friends; so I will work on something else, and see if I can get funding on the basis that it’s vaguely related to AI and safety”.
I have met somewhere between 100-200 AI Safety people in the past ~2 years; people for whom AI Safety is their ‘main thing’.
The vast majority of them are doing tractable/legible/comfortable things. Most are surprisingly naive; have less awareness of the space than I do (and I’m just a generalist lurker who finds this stuff interesting; not actively working on the problem).
Few are actually staring into the void of the hard problems; where hard here is loosely defined as ‘unknown unknowns, here be dragons, where do I even start’.
Fewer still progress from staring into the void to actually trying things.
I think some amount of this is natural and to be expected; I think even in an ideal world we probably still have a similar breakdown—a majority who aren’t contributing (yet)[1], a minority who are—and I think the difference is more in the size of those groups.
I think it’s reasonable to aim for a larger, higher quality, minority; I think it’s tractable to achieve progress through mindfully shaping the funding landscape.
Think it’s worth mentioning that all newbies are useless, and not all newbies remain newbies. Some portion of the majority are actually people who will progress to being useful after they’ve gained experience and wisdom.
it’s tractable to achieve progress through mindfully shaping the funding landscape
This isn’t clear to me, where the crux (though maybe it shouldn’t be) is “is it feasible for any substantial funders to distinguish actually-trying research from other”.
Yeah, I agree sometimes people decide to work on problems largely because they’re tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I’m unconvinced of the flinching away or dishonest characterization.
Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with “how to make world-as-we-know-it ending AGI without it killing everyone”, is going to people who don’t even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren’t aware of it, don’t you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?
On the contrary, I think ~all of the “alignment researchers” I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don’t know are likely substantially worse but not a ton.)
In particular I think all of the alignment-orgs-I’m-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.
This doesn’t feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn’t be helpful category they’d want to know (and, given the “obviously,” would figure that out).
Possibly the situation is very different in academia or MATS-land; for now I’m just talking about the people around me.
I wonder whether John believes that well-liked research, e.g. Fabien’s list, is actually not valuable or rare exceptions coming from a small subset of the “alignment research” field.
I feel like John’s view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I’m pretty sure that’s false.) I assume John doesn’t believe that, and I wonder why he doesn’t think his view entails it.
… but crucially, the details of the rationalizations aren’t that relevant to this post. Someone who’s flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they’ll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.
Yeah. I agree/concede that you can explain why you can’t convince people that their own work is useless. But if you’re positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.
The flinches aren’t structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.
As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it’s impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise—that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping “can an AI be very superhumanly capable” to “no”. That clamping causes them to also not see the flaws in the plan “we’ll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly”, because they don’t think RSI is feasible, they don’t think extreme persuasion is feasible, etc.
A more real example is, say, people thinking of “structures for decision making”, e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can’t understand reflective stability in general, so they don’t understand why steering vectors won’t work, or why lesioning won’t work, etc.
Another real but perhaps more controversial example: {detecting deception, retargeting the search, CoT monitoring, lesioning bad thoughts, basically anything using RL} all fail because creativity starts with illegible concomitants to legible reasoning.
Yep. This post is not for me but I’ll say a thing that annoyed me anyway:
Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. “sensor tampering”) or giving novel arguments that problems are difficult is socially rewarded.)
Yes, absolutely. Five years ago, people were more honest about it, saying ~explicitly and out loud “ah, the real problems are too difficult; and I must eat and have friends; so I will work on something else, and see if I can get funding on the basis that it’s vaguely related to AI and safety”.
To the extent that anecdata is meaningful:
I have met somewhere between 100-200 AI Safety people in the past ~2 years; people for whom AI Safety is their ‘main thing’.
The vast majority of them are doing tractable/legible/comfortable things. Most are surprisingly naive; have less awareness of the space than I do (and I’m just a generalist lurker who finds this stuff interesting; not actively working on the problem).
Few are actually staring into the void of the hard problems; where hard here is loosely defined as ‘unknown unknowns, here be dragons, where do I even start’.
Fewer still progress from staring into the void to actually trying things.
I think some amount of this is natural and to be expected; I think even in an ideal world we probably still have a similar breakdown—a majority who aren’t contributing (yet)[1], a minority who are—and I think the difference is more in the size of those groups.
I think it’s reasonable to aim for a larger, higher quality, minority; I think it’s tractable to achieve progress through mindfully shaping the funding landscape.
Think it’s worth mentioning that all newbies are useless, and not all newbies remain newbies. Some portion of the majority are actually people who will progress to being useful after they’ve gained experience and wisdom.
This isn’t clear to me, where the crux (though maybe it shouldn’t be) is “is it feasible for any substantial funders to distinguish actually-trying research from other”.
Yeah, I agree sometimes people decide to work on problems largely because they’re tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I’m unconvinced of the flinching away or dishonest characterization.
Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with “how to make world-as-we-know-it ending AGI without it killing everyone”, is going to people who don’t even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren’t aware of it, don’t you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?
I do not.
On the contrary, I think ~all of the “alignment researchers” I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don’t know are likely substantially worse but not a ton.)
In particular I think all of the alignment-orgs-I’m-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.
This doesn’t feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn’t be helpful category they’d want to know (and, given the “obviously,” would figure that out).
Possibly the situation is very different in academia or MATS-land; for now I’m just talking about the people around me.
I wonder whether John believes that well-liked research, e.g. Fabien’s list, is actually not valuable or rare exceptions coming from a small subset of the “alignment research” field.
I strongly suspect he thinks most of it is not valuable
I feel like John’s view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I’m pretty sure that’s false.) I assume John doesn’t believe that, and I wonder why he doesn’t think his view entails it.
From the post:
Yeah. I agree/concede that you can explain why you can’t convince people that their own work is useless. But if you’re positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.
The flinches aren’t structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.
As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it’s impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise—that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping “can an AI be very superhumanly capable” to “no”. That clamping causes them to also not see the flaws in the plan “we’ll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly”, because they don’t think RSI is feasible, they don’t think extreme persuasion is feasible, etc.
A more real example is, say, people thinking of “structures for decision making”, e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can’t understand reflective stability in general, so they don’t understand why steering vectors won’t work, or why lesioning won’t work, etc.
Another real but perhaps more controversial example: {detecting deception, retargeting the search, CoT monitoring, lesioning bad thoughts, basically anything using RL} all fail because creativity starts with illegible concomitants to legible reasoning.
(This post seems to be somewhat illegible, but if anyone wants to see more real examples of aspects of mind that people fail to remember, see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html)