Sam Bowman comments on Artificial Sandwiching: When can we test scalable alignment protocols without humans?

Sam Bowman 28 Jul 2022 22:39 UTC
LW: 1 AF: 1
0
AF
Thanks! I’ll admit that I meant to be asking especially about the toxicity case, though I didn’t make that at all clear. As in Charlie’s comment, I’m most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work.
I don’t see a clear picture either way on whether the noisy signal story presents a hard problem that’s distinctively alignment oriented.