The language model is just predicting text. If the model thinks an author is stupid (as evidenced by a stupid prompt) then it will predict stupid content as the followup.
To imagine that it is trying to solve the task of “reasoning without failure” is to project our contextualized common sense on software built for a different purpose than reasoning without failure.
This is what unaligned software does by default: exactly what its construction and design cause it to do, whether or not the constructive causes constrain the software’s behavior to be helpful for a particular use case that seems obvious to us.
The scary thing is that I haven’t seen GPT-3 ever fail to give a really good answer (in its top 10 answers, anyway) when a human brings non-trivial effort to giving it a prompt that actually seems smart, and whose natural extension would also be smart.
This implies to me that the full engine is very good at assessing the level of the text that it is analyzing, and has a (justifiably?) bad opinion of the typical human author. So its cleverness encompasses all the bad thinking… while also containing highly advanced capacities that only are called upon to predict continuations for maybe 1 in 100,000 prompts.
I think this broadly makes sense to me. There are many cases where “the model is pretending to be dumb” feels appropriate.
This is part of why building evaluations and benchmarks for this sort of thing is difficult.
I’m at least somewhat optimistic about doing things like data-prefixing to allow for controls over things like “play dumb for the joke” vs “give the best answer”, using techniques that build on human feedback.
I personally have totally seen GPT-3 fail to give a really good answer on a bunch of tries a bunch of times, but I spend a lot of time looking at it’s outputs and analyzing them. It seems important to be wary of the “seems to be dumb” failure modes.
The language model is just predicting text. If the model thinks an author is stupid (as evidenced by a stupid prompt) then it will predict stupid content as the followup.
To imagine that it is trying to solve the task of “reasoning without failure” is to project our contextualized common sense on software built for a different purpose than reasoning without failure.
This is what unaligned software does by default: exactly what its construction and design cause it to do, whether or not the constructive causes constrain the software’s behavior to be helpful for a particular use case that seems obvious to us.
The scary thing is that I haven’t seen GPT-3 ever fail to give a really good answer (in its top 10 answers, anyway) when a human brings non-trivial effort to giving it a prompt that actually seems smart, and whose natural extension would also be smart.
This implies to me that the full engine is very good at assessing the level of the text that it is analyzing, and has a (justifiably?) bad opinion of the typical human author. So its cleverness encompasses all the bad thinking… while also containing highly advanced capacities that only are called upon to predict continuations for maybe 1 in 100,000 prompts.
I think this broadly makes sense to me. There are many cases where “the model is pretending to be dumb” feels appropriate.
This is part of why building evaluations and benchmarks for this sort of thing is difficult.
I’m at least somewhat optimistic about doing things like data-prefixing to allow for controls over things like “play dumb for the joke” vs “give the best answer”, using techniques that build on human feedback.
I personally have totally seen GPT-3 fail to give a really good answer on a bunch of tries a bunch of times, but I spend a lot of time looking at it’s outputs and analyzing them. It seems important to be wary of the “seems to be dumb” failure modes.