FactorialCode comments on $1000 bounty for OpenAI to show whether GPT3 was “deliberately” pretending to be stupider than it is

FactorialCode 22 Jul 2020 6:06 UTC
16 points
I think this is pretty straight forward to test. GPT-3 gives joint probabilities of string continuations given context strings.

Step 1: Give it 2 promps, one suggesting that it is playing the role of a smart person, and one where it is playing the roll of a dumb person.

Step 2: Ask the “person” a question that demonstrates that persons intelligence. (something like a math problem or otherwise)

Step 2: Write continuations where the person answers correctly and incorrectly

Step 3: Compare the relative probabilities GPT-3 assigns to each continuation given the promps and questions.

If GPT-3 is sandbagging itself, it will assign a notably higher probability to the correct answer when conditioned on the smart person prompt than when conditioned on the dumb person prompt. If it’s not, it will give similar probabilities in both cases.

Step 4: Repeat the experiment with problems of increasing difficulty and plot the relative probability gap. This will show the limits of GPT-3′s reflexive intelligence. (I say reflexive because it can be instructed to solve problems it otherwise couldn’t with the amount of serial computations at it’s disposal by carrying out an algorithm as part of its output, as is the case with parity)

This is an easy $1000 for anyone who has access to the beta API.