Because RLHF works, we shouldn’t be surprised when AI models output wrong answers which are specifically hard for humans to distinguish from a right answer.
This observably (seems like it) generalizes to all humans, instead of (say) it being totally trivial somehow to train an AI on feedback only from some strict and distinguished subset of humanity such that any wrong answers it produced could be easily spotted by the excluded humans.
Such wrong answers which look right (on first glance) also observably exist, and we should thus expect that if there’s anything like a projection-onto-subspace going on here, our “viewpoint” for the projection, given any adjudicating human mind, is likely all clustered in some low-dimensional subspace of all possible viewpoints and maybe even just around a single point.
This is why I’d agree that RLHF was so specifically a bad tradeoff in capabilities improvement vs safety/desirability outcomes but still remain agnostic as to the absolute size of that tradeoff.
Because RLHF works, we shouldn’t be surprised when AI models output wrong answers which are specifically hard for humans to distinguish from a right answer.
This observably (seems like it) generalizes to all humans, instead of (say) it being totally trivial somehow to train an AI on feedback only from some strict and distinguished subset of humanity such that any wrong answers it produced could be easily spotted by the excluded humans.
Such wrong answers which look right (on first glance) also observably exist, and we should thus expect that if there’s anything like a projection-onto-subspace going on here, our “viewpoint” for the projection, given any adjudicating human mind, is likely all clustered in some low-dimensional subspace of all possible viewpoints and maybe even just around a single point.
This is why I’d agree that RLHF was so specifically a bad tradeoff in capabilities improvement vs safety/desirability outcomes but still remain agnostic as to the absolute size of that tradeoff.