Anon User comments on Using GPT-Eliezer against ChatGPT Jailbreaking

Anon User 6 Dec 2022 20:52 UTC
4 points
2
Right, maybe even go 1 step deeper under the hood, and just extract the LLMs probability estimate for the 1st word in the response being “Yes”, then compare it to some safety threshold (which allows making it more conservative if desired)