MiguelDev comments on GPT2XL_RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks

MiguelDev 14 Feb 2024 5:10 UTC
1 point
0
Hello again!
If you finetune the entire network, then that is clearly a superset of just a bit of the network, and means that there are many ways to override or modify the original bit regardless of how small it was.. (If you have a ‘grandmother neuron’, then I can eliminate your ability to remember your grandmother by deleting it… but I could also do that by hitting you on the head. The latter is consistent with most hypotheses about memory.)
I view that the functioning of the grandmother neuron/memory and Luigi/Waluigi roleplay/personality as distintly different....wherein if we are dealing with memory—yes I agree certain network locations/neurons when combined will retrieve a certain instance. However, the idea that we are discussing here is a Luigi/Waluigi roleplay—I look at it as the entire network supporting the personalities.. (Like when the left and right hemisphere conjuring split identities after the brain has been divided into two..).

I would also be hesitant about concluding too much from GPT-2 about anything involving RLHF. After all, a major motivation for creating GPT-3 in the first place was that GPT-2 wasn’t smart enough for RLHF, and RLHF wasn’t working well enough to study effectively. And since the overall trend has been for the smarter the model the simpler & more linear the final representations...

Thank you for explaining why exercising caution is necessary with the results I presented here to GPT-2 but also ignoring the evidences from my projects (this project, another one here & the random responses presented in this post) that GPT-2 (XL) is much smarter than most though is personally not optimal for me. But yeah, I will be mindful of what you said here.