Not Relevant comments on Takeoff speeds have a huge effect on what it means to work on AI x-risk

Not Relevant 14 Apr 2022 3:46 UTC
4 points
Can you talk more about why RL4HF is “concealing problems”? Do you mean “attempting alignment” in a way that other people won’t, or something else?
- Alex Lawsen 14 Apr 2022 6:51 UTC
  12 points
  Parent
  Roughly, “avoid your actions being labelled as bad by humans [or models of humans]” is not quite the same signal as “don’t be bad”.
  - Not Relevant 14 Apr 2022 11:38 UTC
    6 points
    Parent
    Ah ok, so you’re saying RL4HF is bad if it’s the action model. But it seems fine if it’s done to the reward model, right?
    - LawrenceC 15 Apr 2022 10:25 UTC
      1 point
      Parent
      What do you mean by “RLHF is done to the reward model”, and why would that be fine?
      - Not Relevant 15 Apr 2022 11:16 UTC
        2 points
        Parent
        You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF