Ideas come from unsupervised training, answers from supervised training and proofs from RL on a specified reward function.
I think only particular reward functions, such as in multi-agent/co-operative environments (agents can include humans, like in RLHF) or in actually interactive proving environments?
Ideas come from unsupervised training, answers from supervised training and proofs from RL on a specified reward function.
I think only particular reward functions, such as in multi-agent/co-operative environments (agents can include humans, like in RLHF) or in actually interactive proving environments?