Veedrac comments on Ngo and Yudkowsky on alignment difficulty

Veedrac 26 Feb 2022 23:32 UTC
3 points
Eliezer said:
Eg, it wouldn’t surprise us at all if GPT-4 had learned to predict “27 * 18” but not “what is the area of a rectangle 27 meters by 18 meters”… is what I’d like to say, but Codex sure did demonstrate those two were kinda awfully proximal.
GPT-3 Instruct is a version of GPT-3 fine-tuned to follow instructions in a way that its reward model thinks humans would rate highly. It answers both versions of the question correctly when its prompt includes this single manually cherry-picked primer,
Q: what is the volume of a cube with side length 8 meters
A: 512 meters cubed
To reduce the selective power of cherry-picking from a small number of prompts, I tried with the randomly selected 15 * 58 = 870, which gave the correct answer in both cases also.
Without any primer, for the second question GPT-3 Instruct generates “The area of the rectangle is 324 meters.” (NB: 324 = 18²). This is incorrect both in the number and the unit. For the number, [_324] has logit probability 24%, and [_4][86] is second choice with logit probability 22% · 99.4%. For the unit, and conditioning on the number being correct, [.] has logit probability 50%, and [_squared][.] is second choice with logit probability 48% · 99.7%.
Original GPT-3 does not get either version of the question correct nor does it get as close, though I did not try stronger prompting.