Rana Dexsin comments on Can I take ducks home from the park?

Rana Dexsin 15 Sep 2023 15:31 UTC
5 points
5
So from what I can see, this was just one trial per (prompt, model) pair? That seems pretty brittle; it might be more informative to look at the distribution of scores over eleven responses each or something, especially if we don’t care so much about the average as whether a user can take the most helpful response after several queries.
- dynomight 15 Sep 2023 16:40 UTC
  4 points
  2
  Parent
  That would definitely be better, although it would mean reading/scoring 1056 different responses, unless I can automate the scoring process. (Would LLMs object to doing that?)