Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that text-davinci-003 does only a little better than text-davinci-002 in terms of accuracy % may tell you little about how profitable in $ each will be once 4chan & the coomers get their hands on it… It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul’s comment saying “I’ve seen only modest qualitative differences” in order to disagree and say “I think we’ve now seen substantial qualitative differences”.
We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.
It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn’t seem worth going into.
I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul’s comparison: retrieval especially was an interesting dynamic.
Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that
text-davinci-003
does only a little better thantext-davinci-002
in terms of accuracy % may tell you little about how profitable in $ each will be once 4chan & the coomers get their hands on it… It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul’s comment saying “I’ve seen only modest qualitative differences” in order to disagree and say “I think we’ve now seen substantial qualitative differences”.
We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.
I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn’t seem worth going into.
I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul’s comparison: retrieval especially was an interesting dynamic.