Is separate evaluation + context-insensitive aggregation actually how helpful + harmless is implemented in a reward model for any major LLM? I think Anthropic uses finetuning on a mixture of specialized training sets (plus other supervision that’s more holistic) which is sort of like this but allows the model to generalize in a way that compresses the data, not just a way that keeps the same helpful/harmless tradeoff.
Anyhow, of course we’d like to use the “beneficial for humanity” goal, but sadly we don’t have access to it at the moment :D Working on it.
Is separate evaluation + context-insensitive aggregation actually how helpful + harmless is implemented in a reward model for any major LLM? I think Anthropic uses finetuning on a mixture of specialized training sets (plus other supervision that’s more holistic) which is sort of like this but allows the model to generalize in a way that compresses the data, not just a way that keeps the same helpful/harmless tradeoff.
Anyhow, of course we’d like to use the “beneficial for humanity” goal, but sadly we don’t have access to it at the moment :D Working on it.