Benchmarks are weird, imagine comparing a human only along their ability to take a test. Like saying, how do we measure einstein? in his avility to take a test. Someone else who completes that test therefore IS Einstein (not necessarily at all, you can game tests, in ways that aren’t ‘cheating’, just study the relevant material (all the online content ever).
LLM’s ability to properly guide someone through procedures is actually the correct way to evaluate language models. Not written description or solutions, but step by step guiding someone through something impressive, Can the model help me make a
Or even without a human, step by step completing a task.
Benchmarks are weird, imagine comparing a human only along their ability to take a test. Like saying, how do we measure einstein? in his avility to take a test. Someone else who completes that test therefore IS Einstein (not necessarily at all, you can game tests, in ways that aren’t ‘cheating’, just study the relevant material (all the online content ever).
LLM’s ability to properly guide someone through procedures is actually the correct way to evaluate language models. Not written description or solutions, but step by step guiding someone through something impressive, Can the model help me make a
Or even without a human, step by step completing a task.