Sorry for not specifying the method, but I wasn’t referring to RL-based or supervised learning methods. There’s a lot of promise in using a smaller dataset that explains corrigibility characteristics, as well as a shutdown mechanism, all fine-tuned through unsupervised learning.
I have a prototype at this link where I modified GPT2-XL to mention a shutdown phrase whenever all of its attention mechanisms activate and determine that it could harm humans due to its intelligence. I used unsupervised learning to allow patterns from a smaller dataset to achieve this.
Sorry for not specifying the method, but I wasn’t referring to RL-based or supervised learning methods. There’s a lot of promise in using a smaller dataset that explains corrigibility characteristics, as well as a shutdown mechanism, all fine-tuned through unsupervised learning.
I have a prototype at this link where I modified GPT2-XL to mention a shutdown phrase whenever all of its attention mechanisms activate and determine that it could harm humans due to its intelligence. I used unsupervised learning to allow patterns from a smaller dataset to achieve this.