β-redex comments on Using GPT-Eliezer against ChatGPT Jailbreaking

β-redex 6 Dec 2022 23:40 UTC
9 points
2
Isn’t this similar to a Godzilla Strategy? (One AI overseeing the other.)

That variants of this approach are of use to superintelligent AI safety: 40%.

Do you have some more detailed reasoning behind such massive confidence? If yes, it would probably be worth its own post.

This seems like a cute idea that might make current LLM prompt filtering a little less circumventable, but I don’t see any arguments for why this would scale to superintelligent AI. Am I missing something?