Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul’s comment saying “I’ve seen only modest qualitative differences” in order to disagree and say “I think we’ve now seen substantial qualitative differences”.
We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.
It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn’t seem worth going into.
I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul’s comparison: retrieval especially was an interesting dynamic.
Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul’s comment saying “I’ve seen only modest qualitative differences” in order to disagree and say “I think we’ve now seen substantial qualitative differences”.
We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.
I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn’t seem worth going into.
I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul’s comparison: retrieval especially was an interesting dynamic.