RLHF is a trial-and-error approach. For superhuman AGI, that amounts to letting it kill everybody, and then telling that this is bad, don’t do it again.
RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop “self-agendas” such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that support this method and end up killing us all.
Claiming that RLHF is a trial and error approach and therefore poses a risk to humanity is similar to suggesting that airplanes can fall from the sky against the laws of physics because airplane design is a trial and error process, and there is no one solution for the perfect wing shape. Or, it is like saying that a car engine’s trial and error approach could result in a sudden nuclear explosion.
It is important to distinguish between what is mathematically proven and what is fictional. Doing so is crucial to avoid wasting time and energy on implausible or even impossible scenarios and to shift our focus to real issues that actually might influence humanity.
RLHF is a trial-and-error approach. For superhuman AGI, that amounts to letting it kill everybody, and then telling that this is bad, don’t do it again.
RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop “self-agendas” such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that support this method and end up killing us all.
Claiming that RLHF is a trial and error approach and therefore poses a risk to humanity is similar to suggesting that airplanes can fall from the sky against the laws of physics because airplane design is a trial and error process, and there is no one solution for the perfect wing shape. Or, it is like saying that a car engine’s trial and error approach could result in a sudden nuclear explosion.
It is important to distinguish between what is mathematically proven and what is fictional. Doing so is crucial to avoid wasting time and energy on implausible or even impossible scenarios and to shift our focus to real issues that actually might influence humanity.