I’m still fairly optimistic about sandwiching. I deliberately considered a set of pretty naive strategies (“naive safety effort” assumption) to contrast with future posts which will explore strategies that seem more promising. Carefully-constructed versions of debate, amplification, recursive reward-modeling, etc seem like they could make a significant difference and could be tested through a framework like sandwiching.
I’m still fairly optimistic about sandwiching. I deliberately considered a set of pretty naive strategies (“naive safety effort” assumption) to contrast with future posts which will explore strategies that seem more promising. Carefully-constructed versions of debate, amplification, recursive reward-modeling, etc seem like they could make a significant difference and could be tested through a framework like sandwiching.
Thanks for clearing that up. This is super helpful as a context for understanding this post.