Charlie Steiner comments on Artificial Sandwiching: When can we test scalable alignment protocols without humans?

Charlie Steiner 20 Jul 2022 22:22 UTC
LW: 4 AF: 2
1
AF
I do think artificial sandwiching is missing something. What makes original-flavor sandwiches interesting to me is not misalignment per se, but rather the problem of having to get goal information out of the human and into the AI to get it to do a good job (e.g. at generating the text you actually want), even for present-day non-agential AI, and even when the AI is more competent than the human along some relevant dimensions.
Like, the debate example seems to me to have a totally different spirit—sure, the small AI (hopefully) ends up better at the task when augmented with the big AI, but it doesn’t involve getting the big AI to “learn what the small AI is trying to do,” so I’m not interested in it. (Another way of putting this is that debate is aimed at multiple choice questions rather than high-dimensional “essay questions,” which allows it to dodge some of the interesting bits.)
I guess when I put it that way, this is not actually a problem with artificial sandwiching. It’s more the statement that I think that for artificial sandwiching to be interesting, the small AI should still be playing the role of the human, and the tools developed should still be of interest for application to humans even if you can run your experiments in an hour rather than a week.
I’m imagining experiments where the small AI is given some clear goal (maybe as part of a hidden prompt, or prompt schema that changes for different ways the big AI might make queries to the small AI, if there are multiple such ways). Then you have it interact with a big AI in an automated way, so that at the end you have a slightly-modified big AI that actually does the thing the small AI was prompted to do. Maybe by training a classifier or by engineering a prompt of the big AI, it doesn’t have to go all the way to finetuning.