I don’t think experiments like this are meaningful without a bunch of trials and statistical significance. The outputs of models (even RLHF models) on these kinds of things has pretty high variance, so it’s really hard to draw any conclusion from single sample comparisons like this.
Although I think it’s a stretch to say they “aren’t meaningful”, I do agree a more scientific test would be nice. It’s a bit tricky when you only got 25 messages per 3 hours though, lol.
More generally, it’s hard to tell how to objectively quantify agency in the responses, and how to eliminate other hypotheses (like that GPT-4 is just more familiar with itself than other AIs).
I don’t think experiments like this are meaningful without a bunch of trials and statistical significance. The outputs of models (even RLHF models) on these kinds of things has pretty high variance, so it’s really hard to draw any conclusion from single sample comparisons like this.
Although I think it’s a stretch to say they “aren’t meaningful”, I do agree a more scientific test would be nice. It’s a bit tricky when you only got 25 messages per 3 hours though, lol.
More generally, it’s hard to tell how to objectively quantify agency in the responses, and how to eliminate other hypotheses (like that GPT-4 is just more familiar with itself than other AIs).