Yeah, you’re definitely pointing at an important way the framing is awkward. I think the real thing I want to say is “Try to use some humans to align a model in a domain where the model is better than the humans at the task”, and it’d be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there’s some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse.
I don’t want to just call it “align superhuman AI today” because people will be like “What? We don’t have that”, but at the same time I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.” I considered “partially superhuman”, but “narrowly” won out.
I’m definitely in the market for a better term here.
I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.”
One response I generated was, “maybe it’s just not so much about practicing what we eventually want to do, and that part is an illusion of the poor framing. We should figure out the right framing first and then ask whether it seems like practice, not optimize the framing to make it sound like practice.”
But I think my real response is: why is the superhuman part important, here? Maybe what’s really important is being able to get answers (eg medical advice) without putting them in (eg without fine-tuning on medical advice filtered for high quality), and asking for superhuman ability is just a way of helping ensure that? Or perhaps more generally, there are other things like this which you expect people to do wrong if they’re not dealing with a superhuman case, because you want the technology to eventually work for superhuman cases.
In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn’t trying here to make something different sound like it’s about practice. I don’t think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I’d be similarly excited about or maybe more excited about.
In my mind, the “better than evaluators” part is kind of self-evidently intriguing for the basic reason I said in the post (it’s not obvious how to do it, and it’s analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn’t strongly tied to a particular theoretical framing):
I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.
A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to “knowing everything the model knows” or “ascription universality”; the section “Why not focus on testing a long-term solution?” was written in response to Evan Hubinger and others. I think I’m still not convinced that’s the right way to go.
I might be on board if “narrowly superhuman” were simply defined differently.
“Try to use some humans to align a model in a domain where the model is better than the humans at the task”
Isn’t it something more like “the model has information sufficient to do better”? EG, in the GPT example, you can’t reliably get good medical advice from it right now, but you strongly suspect it’s possible. That’s a key feature of the whole idea, right?
Is your suggested research program better described as: find (highly capable) models with inaccessible information and get them to reveal that information? (Especially: get them to reveal the inaccessible information without using domain expertise to do so?)
I don’t feel confident enough in the frame of “inaccessible information” to say that the whole agenda is about it. It feels like a fit for “advice”, but not a fit for “writing stories” or “solving programming puzzles” (at least not an intuitive fit—you could frame it as “the model has inaccessible information about [story-writing, programming]” but it feels more awkward to me). I do agree it’s about “strongly suspecting it has the potential to do better than humans” rather than about “already being better than humans.” Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Right, ok, I like that framing better (it obviously fits, but I didn’t generate it as a description before).
Yeah, you’re definitely pointing at an important way the framing is awkward. I think the real thing I want to say is “Try to use some humans to align a model in a domain where the model is better than the humans at the task”, and it’d be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there’s some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse.
I don’t want to just call it “align superhuman AI today” because people will be like “What? We don’t have that”, but at the same time I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.” I considered “partially superhuman”, but “narrowly” won out.
I’m definitely in the market for a better term here.
One response I generated was, “maybe it’s just not so much about practicing what we eventually want to do, and that part is an illusion of the poor framing. We should figure out the right framing first and then ask whether it seems like practice, not optimize the framing to make it sound like practice.”
But I think my real response is: why is the superhuman part important, here? Maybe what’s really important is being able to get answers (eg medical advice) without putting them in (eg without fine-tuning on medical advice filtered for high quality), and asking for superhuman ability is just a way of helping ensure that? Or perhaps more generally, there are other things like this which you expect people to do wrong if they’re not dealing with a superhuman case, because you want the technology to eventually work for superhuman cases.
In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn’t trying here to make something different sound like it’s about practice. I don’t think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I’d be similarly excited about or maybe more excited about.
In my mind, the “better than evaluators” part is kind of self-evidently intriguing for the basic reason I said in the post (it’s not obvious how to do it, and it’s analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn’t strongly tied to a particular theoretical framing):
A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to “knowing everything the model knows” or “ascription universality”; the section “Why not focus on testing a long-term solution?” was written in response to Evan Hubinger and others. I think I’m still not convinced that’s the right way to go.
I might be on board if “narrowly superhuman” were simply defined differently.
Isn’t it something more like “the model has information sufficient to do better”? EG, in the GPT example, you can’t reliably get good medical advice from it right now, but you strongly suspect it’s possible. That’s a key feature of the whole idea, right?
Is your suggested research program better described as: find (highly capable) models with inaccessible information and get them to reveal that information? (Especially: get them to reveal the inaccessible information without using domain expertise to do so?)
I don’t feel confident enough in the frame of “inaccessible information” to say that the whole agenda is about it. It feels like a fit for “advice”, but not a fit for “writing stories” or “solving programming puzzles” (at least not an intuitive fit—you could frame it as “the model has inaccessible information about [story-writing, programming]” but it feels more awkward to me). I do agree it’s about “strongly suspecting it has the potential to do better than humans” rather than about “already being better than humans.” Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Right, ok, I like that framing better (it obviously fits, but I didn’t generate it as a description before).