Daniel Kokotajlo comments on Approaches to gradient hacking

Daniel Kokotajlo 15 Aug 2021 9:49 UTC
LW: 4 AF: 2
AF
Great, now GPT-4 and beyond have an instruction manual in their training data! :D
...haha kidding not kidding...
I guess the risk is pretty minimal, since probably any given AI will be either dumb enough to not understand this post or smart enough to not need to read it.
- adamShimi 15 Aug 2021 12:35 UTC
  LW: 5 AF: 4
  AF Parent
  A “simple” solution I just thought about:just convince people training AGIs to not scrap the AF or Alignment literature.
  - Daniel Kokotajlo 15 Aug 2021 12:43 UTC
    LW: 14 AF: 12
    AF Parent
    Simpler: Ask the LW team to make various posts (this one, any others that seem iffy) non-scrapable, or not-easily-scrapable. I think there’s a button they can press for that.
- adamShimi 15 Aug 2021 12:33 UTC
  LW: 4 AF: 3
  AF Parent
  I plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;)
  
  My initial take is that this post is fine because every scheme proposed is really hard and I’m pointing the difficulty.
  
  Two clear risks though:
  - An AGI using that thinking to make these approaches work
  - The AGI not making mistakes in gradient hacking because it knows to not use these strategies (which assumes a better one exists)
  (Note that I also don’t expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)
  - Daniel Kokotajlo 15 Aug 2021 12:44 UTC
    LW: 2 AF: 2
    AF Parent
    I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!
    - adamShimi 16 Aug 2021 13:27 UTC
      LW: 2 AF: 1
      AF Parent
      Sure. I started studying Bostrom’s paper today; I’ll send you a message for a call when I read and thought enough to have something interesting to share and debate.
- lberglund 21 Sep 2022 15:01 UTC
  1 point
  Parent
  Might be too late now, but is it worth editing the post to include the canary string to prevent use in LM training, just like this post does?