I don’t think you’re doing anything different that what OpenAI is doing, the Eliezer prompt might be slightly better for eliciting model capabilities than whatever FT they did, but as other people have pointed out it’s also way more conservative and probably hurts performance overall.
(If the point is not to allow the AI to output anything misaligned, being conservative is probably the point, and lowering performance seems to be more than acceptable.)
I don’t think you’re doing anything different that what OpenAI is doing, the Eliezer prompt might be slightly better for eliciting model capabilities than whatever FT they did, but as other people have pointed out it’s also way more conservative and probably hurts performance overall.
(If the point is not to allow the AI to output anything misaligned, being conservative is probably the point, and lowering performance seems to be more than acceptable.)
Yes, but OpenAI could have just done that by adjusting their classification threshold.
Isn’t that only the case if their filter was the same but weaker?