adamShimi comments on Approaches to gradient hacking

adamShimi 15 Aug 2021 12:33 UTC
LW: 4 AF: 3
AF
I plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;)

My initial take is that this post is fine because every scheme proposed is really hard and I’m pointing the difficulty.

Two clear risks though:
- An AGI using that thinking to make these approaches work
- The AGI not making mistakes in gradient hacking because it knows to not use these strategies (which assumes a better one exists)
(Note that I also don’t expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)
- Daniel Kokotajlo 15 Aug 2021 12:44 UTC
  LW: 2 AF: 2
  AF Parent
  I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!
  - adamShimi 16 Aug 2021 13:27 UTC
    LW: 2 AF: 1
    AF Parent
    Sure. I started studying Bostrom’s paper today; I’ll send you a message for a call when I read and thought enough to have something interesting to share and debate.