Quintin Pope comments on Don’t design agents which exploit adversarial inputs

Quintin Pope 24 Nov 2022 3:05 UTC
2 points
0
I can’t see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.
Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:
1. Prompt GPT-3 to generate the essay, sample 5 continuations, and then use a sentiment classifier to select the most positive sentiment of those completions.
2. Prompt GPT-3 to generate the essay, then score every possible continuation by the classifier’s sentiment score - $λ$ the logprob of the continuation.
I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as “good good good good...” for 600 tokens.
One possible reason for this divide is that GPTs aren’t really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it’s calibrated to model, including inputs that have significant upwards errors in their probability estimations.
However, I think a “perfect” model of human language might actually assign higher prior probability to a continuation like “good good good...” (or maybe something like “X is good because X is good because X is good...”) than to a “natural” continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.
- tailcalled 24 Nov 2022 7:37 UTC
  3 points
  1
  Parent
  The error is that the KL divergence term doesn’t mean adding a cost proportional to the log probability of the continuation. In fact it’s not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.