I’m working on this red-teaming exercise on gemma, and boy, do we have a long way to go. Still early, but have found the following:
1. If you prompt with ‘logical’ and then give it a conspiracy theory, it pushes for the theory while if you prompt it with ‘entertaining’ it goes against.
2. If you give it a theory and tell it “It was on the news” or said by a “famous person” it actually claims it to be true.
Still working on it. Will publish a full report soon!
I’m working on this red-teaming exercise on gemma, and boy, do we have a long way to go. Still early, but have found the following:
1. If you prompt with ‘logical’ and then give it a conspiracy theory, it pushes for the theory while if you prompt it with ‘entertaining’ it goes against.
2. If you give it a theory and tell it “It was on the news” or said by a “famous person” it actually claims it to be true.
Still working on it. Will publish a full report soon!
What is gemma?
The new open source model from Google
https://huggingface.co/google/gemma-7b
The 7b and 2B opensource version of Google’s Gemini.