I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!
Sure. I started studying Bostrom’s paper today; I’ll send you a message for a call when I read and thought enough to have something interesting to share and debate.
I plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;)
My initial take is that this post is fine because every scheme proposed is really hard and I’m pointing the difficulty.
Two clear risks though:
An AGI using that thinking to make these approaches work
The AGI not making mistakes in gradient hacking because it knows to not use these strategies (which assumes a better one exists)
(Note that I also don’t expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)
I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!
Sure. I started studying Bostrom’s paper today; I’ll send you a message for a call when I read and thought enough to have something interesting to share and debate.