Thanks for writing this post—I appreciate the candidness about your beliefs here, and I agree that this is a tricky topic. I, too, feel unsettled about it on the object level.
On the meta level, though, I feel grumpy about some of the framing choices. There’s this wording which both you and the original ARC evals post use: that responsible scaling policies are a “robustly good compromise,” or, in ARC’s case, that they are a “pragmatic middle ground.” I think these stances take for granted that the best path forward is compromising, but this seems very far from clear to me.
Like, certainly not all cases of “people have different beliefs and preferences” are ones where compromise is the best solution. If someone wants to kill me, I’m not going to be open to negotiating about how many limbs I’m okay with them taking. This is obviously an extreme example, but I actually don’t think it’s that far off from the situation we find ourselves in now where, e.g., Dario gives a 10-25% probability that the sort of technology he is advancing will either cause massive catastrophe or end the human race. When people are telling me that their work has a high chance of killing me, it doesn’t feel obvious that the right move is “compromising” or “finding a middle ground.”
The language choices here feel sketchy to me in the same way that the use of the term “responsible” feels sketchy to me. I certainly wouldn’t call the choice to continue building the unsettling-chance-of-annihilation machine responsible. Perhaps it’s more responsible than the default, but that’s a different claim and not one that is communicated in the name. Similarly, “compromise” and “middle ground” are the kinds of phrases that seem reasonable from a distance, but if you look closer they’re sort of implicitly requesting that we treat “keep racing ahead to our likely demise” as a sensible option.
“Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me.”
This seems to me to misrepresent the argument. At the very least, it misrepresents mine. It’s not that I’m fighting an improvement to the status quo, it’s that I don’t think responsible scaling policies are an improvement if they end up being confused for sufficient progress.
Like, in the worlds where alignment is hard, and where evals do not identify the behavior which is actually scary, then I claim that the existence of such evals is concerning. It’s concerning because I suspect that capabilities labs are more incentivized to check off the “passed this eval” box than they are to ensure that their systems are actually safe. And in the absence of a robust science of alignment, I claim that this most likely results in capability labs goodharting on evals which are imperfect proxies for what we care about, making systems look safer than they are. This does not seem like an improvement to me. I want the ability to say what’s actually true, here: that we do not know what’s going on, and that we’re building a godlike power anyway.
And I’m not saying that this is the only way responsible scaling policies could work out, or that it would necessarily be intentional, or that nobody in capabilities labs take the risk seriously. But it seems like a blindspot to neglectthe the existence of the incentive landscape, here, one which is almost certainly affecting the policies that capabilities labs establish.
I don’t think the communications you’re referring to “take for granted that the best path forward is compromising.” I would simply say that they point out the compromise aspect as a positive consideration, which seems fair to me—“X is a compromise” does seem like a point in favor of X all else equal (implying that it can unite a broader tent), though not a dispositive point.
I address the point about improvements on the status quo in my response to Akash above.
Thanks for writing this post—I appreciate the candidness about your beliefs here, and I agree that this is a tricky topic. I, too, feel unsettled about it on the object level.
On the meta level, though, I feel grumpy about some of the framing choices. There’s this wording which both you and the original ARC evals post use: that responsible scaling policies are a “robustly good compromise,” or, in ARC’s case, that they are a “pragmatic middle ground.” I think these stances take for granted that the best path forward is compromising, but this seems very far from clear to me.
Like, certainly not all cases of “people have different beliefs and preferences” are ones where compromise is the best solution. If someone wants to kill me, I’m not going to be open to negotiating about how many limbs I’m okay with them taking. This is obviously an extreme example, but I actually don’t think it’s that far off from the situation we find ourselves in now where, e.g., Dario gives a 10-25% probability that the sort of technology he is advancing will either cause massive catastrophe or end the human race. When people are telling me that their work has a high chance of killing me, it doesn’t feel obvious that the right move is “compromising” or “finding a middle ground.”
The language choices here feel sketchy to me in the same way that the use of the term “responsible” feels sketchy to me. I certainly wouldn’t call the choice to continue building the unsettling-chance-of-annihilation machine responsible. Perhaps it’s more responsible than the default, but that’s a different claim and not one that is communicated in the name. Similarly, “compromise” and “middle ground” are the kinds of phrases that seem reasonable from a distance, but if you look closer they’re sort of implicitly requesting that we treat “keep racing ahead to our likely demise” as a sensible option.
This seems to me to misrepresent the argument. At the very least, it misrepresents mine. It’s not that I’m fighting an improvement to the status quo, it’s that I don’t think responsible scaling policies are an improvement if they end up being confused for sufficient progress.
Like, in the worlds where alignment is hard, and where evals do not identify the behavior which is actually scary, then I claim that the existence of such evals is concerning. It’s concerning because I suspect that capabilities labs are more incentivized to check off the “passed this eval” box than they are to ensure that their systems are actually safe. And in the absence of a robust science of alignment, I claim that this most likely results in capability labs goodharting on evals which are imperfect proxies for what we care about, making systems look safer than they are. This does not seem like an improvement to me. I want the ability to say what’s actually true, here: that we do not know what’s going on, and that we’re building a godlike power anyway.
And I’m not saying that this is the only way responsible scaling policies could work out, or that it would necessarily be intentional, or that nobody in capabilities labs take the risk seriously. But it seems like a blindspot to neglect the the existence of the incentive landscape, here, one which is almost certainly affecting the policies that capabilities labs establish.
Thanks for the thoughts!
I don’t think the communications you’re referring to “take for granted that the best path forward is compromising.” I would simply say that they point out the compromise aspect as a positive consideration, which seems fair to me—“X is a compromise” does seem like a point in favor of X all else equal (implying that it can unite a broader tent), though not a dispositive point.
I address the point about improvements on the status quo in my response to Akash above.