The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.
The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.