OpenAI’s focus with doing these kinds of augmentations is very much “fixing bugs” with how GPT behaves: Keep GPT on task, prevent GPT from making obvious mistakes, and stop GPT from producing controversial or objectionable content. Notice that these are all things that GPT is very poorly suited for, but humans find quite easy (when they want to). OpenAI is forced to do these things, because as a public facing company they have to avoid disastrous headlines like, for example: Racist AI writes manifesto denying holocaust.[7]
As alignment researchers, we don’t need to worry about any of that! The goal is to solve alignment, and as such we don’t have to be constrained like this in how we use language models. We don’t need to try to “align” language models by adding some RLHF, we need to use language models to enable us to actually solve alignment at its core, and as such we are free to explore a much wider space of possible strategies for using GPT to speed up our research.[8]
I think this section is wrong because it’s looking through the wrong frame. It’s true that cyborgs won’t have to care about PR disasters in their augments, but it feels like this section is missing the fact that there are actual issues behind the PR problem. Consider a model that will produce the statement “Men are doctors; women are nurses” (when not explicitly playing an appropriate role); this indicates something wrong with its epistemics, and I’d have some concern about using such a model.
Put differently, a lot of the research on fixing algorithmic bias can be viewed as trying to repair known epistemic issues in the model (usually which are inherited from the training data); this seems actively desirable for the use case described here.
I think this section is wrong because it’s looking through the wrong frame. It’s true that cyborgs won’t have to care about PR disasters in their augments, but it feels like this section is missing the fact that there are actual issues behind the PR problem. Consider a model that will produce the statement “Men are doctors; women are nurses” (when not explicitly playing an appropriate role); this indicates something wrong with its epistemics, and I’d have some concern about using such a model.
Put differently, a lot of the research on fixing algorithmic bias can be viewed as trying to repair known epistemic issues in the model (usually which are inherited from the training data); this seems actively desirable for the use case described here.