Thanks for doing this important research! I may have found 2 minor typos:
The abstract says “We find the model complies with harmful queries from free users 14% of the time”, but in other places it says 12% - should it be 12%?
In the blog post, “sabotage evaluations” seems to link to a private link
Thanks for doing this important research! I may have found 2 minor typos:
The abstract says “We find the model complies with harmful queries from free users 14% of the time”, but in other places it says 12% - should it be 12%?
In the blog post, “sabotage evaluations” seems to link to a private link