I hear you as saying “If we don’t have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?”
One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.
But if you want the best evaluation of a research assistant’s capabilities, I agress using it as a research assistant is more reliable.
A separate issue I have here is the assumption that you don’t have to worry about teaching an AI to make human-friendly decisions if you’re using it as a research assistant, and therefore we can go full speed ahead trying to make general-purpose AI as long as we mean to use it as a research assistant. A big “trust us, we’re the good guys” vibe.
Relative to string theory, getting an AI to help use do AI alignment is much more reliant on teaching the AI to give good suggestions in the first place—and not merely “good” in the sense of highly rated, but good in the contains-hard-parts-of-outer-alignment kinda way. So I disagree with the assumption in the first place.
And then I also disagree with the conclusion. Technology proliferates, and there are misuse opportunities even within an organization that’s 99% “good guys.” But maybe this is a strategic disagreement more than a factual one.
I don’t think the evaluations we’re describing here are about measuring capabilites. More like measuring whether our oversight (and other aspects) suffice for avoiding misalignment failures.
I hear you as saying “If we don’t have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?”
One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.
But if you want the best evaluation of a research assistant’s capabilities, I agress using it as a research assistant is more reliable.
A separate issue I have here is the assumption that you don’t have to worry about teaching an AI to make human-friendly decisions if you’re using it as a research assistant, and therefore we can go full speed ahead trying to make general-purpose AI as long as we mean to use it as a research assistant. A big “trust us, we’re the good guys” vibe.
Relative to string theory, getting an AI to help use do AI alignment is much more reliant on teaching the AI to give good suggestions in the first place—and not merely “good” in the sense of highly rated, but good in the contains-hard-parts-of-outer-alignment kinda way. So I disagree with the assumption in the first place.
And then I also disagree with the conclusion. Technology proliferates, and there are misuse opportunities even within an organization that’s 99% “good guys.” But maybe this is a strategic disagreement more than a factual one.
I don’t think the evaluations we’re describing here are about measuring capabilites. More like measuring whether our oversight (and other aspects) suffice for avoiding misalignment failures.
Measuring capabilities should be easy.
Yeah, I don’t know where my reading comprehension skills were that evening, but they weren’t with me :P
Oh well, I’ll just leave it as is as a monument to bad comments.