With RLHF I understand that when you push super hard for high reward you end up with nonsense results so you have to settle for quantilization or some such relaxation of maximization. Do you find similar things for ‘best incorporates the feedback’?
Have we really pushed the boundaries of what language models giving themselves feedback is capable of? I’d expect SotA systems are sufficiently good at giving feedback, such that I wouldn’t be surprised that they’d be capable of performing all steps, including the human feedback, in these algorithms, especially lots of the easier categories of feedback, leading to possibility of unlimited of cheap finetuning. Nonetheless I don’t think we’ve reached the point of reflexive endorsement that I’d expect to result from this process (GPT-4 still doing harmful/hallucinated completions that I expect it would be able to recognise). Expect it must be one of
It in fact is at reflexive equilibrium / it wouldn’t actually recognise these failures
OAI haven’t tried pushing it to the limit
This process doesn’t actually result in reflexive endorsement, probably because it only reaches RE within a narrow distribution in which this training is occurring.
OAI stop before this point for other reasons, most likely degradation of performance.
Not sure which of these is true though?
Though the core algorithm I expect to be helpful because we’re stuck with RLHF-type work at the moment, having a paper focused on accurate code generation seems to push the dangerous side of a dual-use capability to the fore.
Thoughts:
Seems like useful work.
With RLHF I understand that when you push super hard for high reward you end up with nonsense results so you have to settle for quantilization or some such relaxation of maximization. Do you find similar things for ‘best incorporates the feedback’?
Have we really pushed the boundaries of what language models giving themselves feedback is capable of? I’d expect SotA systems are sufficiently good at giving feedback, such that I wouldn’t be surprised that they’d be capable of performing all steps, including the human feedback, in these algorithms, especially lots of the easier categories of feedback, leading to possibility of unlimited of cheap finetuning. Nonetheless I don’t think we’ve reached the point of reflexive endorsement that I’d expect to result from this process (GPT-4 still doing harmful/hallucinated completions that I expect it would be able to recognise). Expect it must be one of
It in fact is at reflexive equilibrium / it wouldn’t actually recognise these failures
OAI haven’t tried pushing it to the limit
This process doesn’t actually result in reflexive endorsement, probably because it only reaches RE within a narrow distribution in which this training is occurring.
OAI stop before this point for other reasons, most likely degradation of performance.
Not sure which of these is true though?
Though the core algorithm I expect to be helpful because we’re stuck with RLHF-type work at the moment, having a paper focused on accurate code generation seems to push the dangerous side of a dual-use capability to the fore.