Are you interested of funding this theory of mine that I submitted to AI alignment awards? I am able to make this work in GPT2 and now writing the results. I was able to make GPT2 shutdown itself (100% of the time) even if it’s aware of the shutdown instruction called “the Gauntlet” embedded through fine-tuning an artificially generated archetype called “the Guardian” essentially solving corrigibility, outer and inner alignment.
Let me know if you guys are interested. I want to test it in higher parameter models like Llama and Alpaca but don’t have the means to finance the equipment.
I also found out that there is a weird setting in the temperature for GPT2 where in the range of .498 to .50 my shutdown code works really well, I still don’t know why though. But yeah I believe that there is an incentive to review what’s happening inside the transformer architecture.
I’ll post my paper for the corrigibility solution too once finished probably next week but if you wish to contact me, just reply here or email me at migueldeguzmandev@gmail.com.
Hello there,
Are you interested of funding this theory of mine that I submitted to AI alignment awards? I am able to make this work in GPT2 and now writing the results. I was able to make GPT2 shutdown itself (100% of the time) even if it’s aware of the shutdown instruction called “the Gauntlet” embedded through fine-tuning an artificially generated archetype called “the Guardian” essentially solving corrigibility, outer and inner alignment.
https://twitter.com/whitehatStoic/status/1646429585133776898?t=WymUs_YmEH8h_HC1yqc_jw&s=19
Let me know if you guys are interested. I want to test it in higher parameter models like Llama and Alpaca but don’t have the means to finance the equipment.
I also found out that there is a weird setting in the temperature for GPT2 where in the range of .498 to .50 my shutdown code works really well, I still don’t know why though. But yeah I believe that there is an incentive to review what’s happening inside the transformer architecture.
Here was my original proposal: https://www.whitehatstoic.com/p/research-proposal-leveraging-jungian
I’ll post my paper for the corrigibility solution too once finished probably next week but if you wish to contact me, just reply here or email me at migueldeguzmandev@gmail.com.
If you want to see my meeting schedule, You can find it here: https://calendly.com/migueldeguzmandev/60min
Looking forward to hearing from you.
Best regards,
Miguel
Update: Already sent an application, I didn’t saw that in my first read. Thank you.