So from what I can see, this was just one trial per (prompt, model) pair? That seems pretty brittle; it might be more informative to look at the distribution of scores over eleven responses each or something, especially if we don’t care so much about the average as whether a user can take the most helpful response after several queries.
That would definitely be better, although it would mean reading/scoring 1056 different responses, unless I can automate the scoring process. (Would LLMs object to doing that?)
So from what I can see, this was just one trial per (prompt, model) pair? That seems pretty brittle; it might be more informative to look at the distribution of scores over eleven responses each or something, especially if we don’t care so much about the average as whether a user can take the most helpful response after several queries.
That would definitely be better, although it would mean reading/scoring 1056 different responses, unless I can automate the scoring process. (Would LLMs object to doing that?)