I wonder if this entails that RLHF, while currently useful for capabilities, will eventually become an alignment tax. Namely OpenAI might have text evaluators discourage the LM from writing self-calling agenty looking code.
So in thinking about alignment futures that are the limit of RLHF, these feel like two fairly different forks of that future.
I wonder if this entails that RLHF, while currently useful for capabilities, will eventually become an alignment tax. Namely OpenAI might have text evaluators discourage the LM from writing self-calling agenty looking code.
So in thinking about alignment futures that are the limit of RLHF, these feel like two fairly different forks of that future.