Would you be a better RLHF labeler than GPT-4?

Have you ever tried to label your own data? Have you ever tried to run an evaluation, yourself, on a system that you’ve built? It’s difficult. We, humans, are imperfect. We are terrible at labeling. We are also expensive, and difficult to scale.

Lets say you wanted to create a high quality instruct dataset, today. What would generate a better output? A high resource model? Or, a farm of hired humans?

How many human RLHF labels are incorrect?

How many benchmarks are our machines now superhuman at?

If we’re not bootstrapped, we will be soon.