Seems like at some point we’ll need to train on outputs too complex for humans to evaluate, then we’ll end up using training methods based on outcomes in some simulation.
I agree. Personally my main takeaway is that it’s unwise to extrapolate alignment dynamics from the empirical results of current methods. But this is a somewhat different line of argument which I made in Where do you get your capabilities from?.
Seems like at some point we’ll need to train on outputs too complex for humans to evaluate, then we’ll end up using training methods based on outcomes in some simulation.
I agree. Personally my main takeaway is that it’s unwise to extrapolate alignment dynamics from the empirical results of current methods. But this is a somewhat different line of argument which I made in Where do you get your capabilities from?.