Great, now GPT-4 and beyond have an instruction manual in their training data! :D
...haha kidding not kidding...
I guess the risk is pretty minimal, since probably any given AI will be either dumb enough to not understand this post or smart enough to not need to read it.
Simpler: Ask the LW team to make various posts (this one, any others that seem iffy) non-scrapable, or not-easily-scrapable. I think there’s a button they can press for that.
I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!
Sure. I started studying Bostrom’s paper today; I’ll send you a message for a call when I read and thought enough to have something interesting to share and debate.
Great, now GPT-4 and beyond have an instruction manual in their training data! :D
...haha kidding not kidding...
I guess the risk is pretty minimal, since probably any given AI will be either dumb enough to not understand this post or smart enough to not need to read it.
A “simple” solution I just thought about:just convince people training AGIs to not scrap the AF or Alignment literature.
Simpler: Ask the LW team to make various posts (this one, any others that seem iffy) non-scrapable, or not-easily-scrapable. I think there’s a button they can press for that.
I plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;)
My initial take is that this post is fine because every scheme proposed is really hard and I’m pointing the difficulty.
Two clear risks though:
An AGI using that thinking to make these approaches work
The AGI not making mistakes in gradient hacking because it knows to not use these strategies (which assumes a better one exists)
(Note that I also don’t expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)
I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!
Sure. I started studying Bostrom’s paper today; I’ll send you a message for a call when I read and thought enough to have something interesting to share and debate.
Might be too late now, but is it worth editing the post to include the canary string to prevent use in LM training, just like this post does?