I’m not sure how “explanations for corrigibility” would be relevant here (though I’m also not sure exactly what you’re picturing). If an AI had the capability to directly shut itself down, and were fine-tuned in an environment where it could use that ability and be rewarded accordingly, then testing its usage of that ability would definitely be a way to test shutdown-corrigibility. There are still subtleties to account for in the experiment setup (e.g. things mentioned here), but it’s a basically-viable way to ground things.
I’m not sure how “explanations for corrigibility” would be relevant here (though I’m also not sure exactly what you’re picturing).
Just to clarify my meaning of explaining corrigibility: In my projects, my aim is not simply to enable GPT-2 XL to execute a shutdown procedure, but also to ensure that it is a thoroughly considered process. Additionally, I want to be able to examine the changes in mean and standard deviation of the 600,000 QKV weights.
Yes, I’m aware that it’s not a complete solution since I cannot explain why each individual weight changes. However, seeing the ATL successfully modify the network and observing the reliable response from GPT-2 XL is why I’m pursuing this method further.
I’m not sure how “explanations for corrigibility” would be relevant here (though I’m also not sure exactly what you’re picturing). If an AI had the capability to directly shut itself down, and were fine-tuned in an environment where it could use that ability and be rewarded accordingly, then testing its usage of that ability would definitely be a way to test shutdown-corrigibility. There are still subtleties to account for in the experiment setup (e.g. things mentioned here), but it’s a basically-viable way to ground things.
Thanks for your reply.
Just to clarify my meaning of explaining corrigibility: In my projects, my aim is not simply to enable GPT-2 XL to execute a shutdown procedure, but also to ensure that it is a thoroughly considered process. Additionally, I want to be able to examine the changes in mean and standard deviation of the 600,000 QKV weights.
Yes, I’m aware that it’s not a complete solution since I cannot explain why each individual weight changes. However, seeing the ATL successfully modify the network and observing the reliable response from GPT-2 XL is why I’m pursuing this method further.