Based on this post I was wondering whether your views have shifted away from your proposal of “The case for aligning narrowly superhuman models” and also the “Sandwiching” problem that you propose there? I am not sure if this question is warranted given your new post. But it seems to me, that potential projects you propose in “narrowly aligning superhuman models” are to some extent the similar to things you address in this new post as making it likely to eventually lead to a “full-blown AI takeover”. Or put differently are those different sides of the same coin, that depend on the situation (e.g. who pursues it and how, etc.), or did you update your views away from that previous post?
I’m still fairly optimistic about sandwiching. I deliberately considered a set of pretty naive strategies (“naive safety effort” assumption) to contrast with future posts which will explore strategies that seem more promising. Carefully-constructed versions of debate, amplification, recursive reward-modeling, etc seem like they could make a significant difference and could be tested through a framework like sandwiching.
Based on this post I was wondering whether your views have shifted away from your proposal of “The case for aligning narrowly superhuman models” and also the “Sandwiching” problem that you propose there? I am not sure if this question is warranted given your new post. But it seems to me, that potential projects you propose in “narrowly aligning superhuman models” are to some extent the similar to things you address in this new post as making it likely to eventually lead to a “full-blown AI takeover”.
Or put differently are those different sides of the same coin, that depend on the situation (e.g. who pursues it and how, etc.), or did you update your views away from that previous post?
I’m still fairly optimistic about sandwiching. I deliberately considered a set of pretty naive strategies (“naive safety effort” assumption) to contrast with future posts which will explore strategies that seem more promising. Carefully-constructed versions of debate, amplification, recursive reward-modeling, etc seem like they could make a significant difference and could be tested through a framework like sandwiching.
Thanks for clearing that up. This is super helpful as a context for understanding this post.