Alex Lawsen comments on Takeoff speeds have a huge effect on what it means to work on AI x-risk

Alex Lawsen 14 Apr 2022 6:51 UTC
12 points
Roughly, “avoid your actions being labelled as bad by humans [or models of humans]” is not quite the same signal as “don’t be bad”.
- Not Relevant 14 Apr 2022 11:38 UTC
  6 points
  Parent
  Ah ok, so you’re saying RL4HF is bad if it’s the action model. But it seems fine if it’s done to the reward model, right?
  - LawrenceC 15 Apr 2022 10:25 UTC
    1 point
    Parent
    What do you mean by “RLHF is done to the reward model”, and why would that be fine?
    - Not Relevant 15 Apr 2022 11:16 UTC
      2 points
      Parent
      You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF