Let’s be more explicit about what such a “better implementation/operationalization” would look like, and what it would/wouldn’t tell us. Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
Just to confirm, if a tuning process could alter the QKV weights, allowing the AI to provide explanations for corrigibility or to shut itself down, would this be the type of symbol-based corrigibility you are seeking?
I’m not sure how “explanations for corrigibility” would be relevant here (though I’m also not sure exactly what you’re picturing). If an AI had the capability to directly shut itself down, and were fine-tuned in an environment where it could use that ability and be rewarded accordingly, then testing its usage of that ability would definitely be a way to test shutdown-corrigibility. There are still subtleties to account for in the experiment setup (e.g. things mentioned here), but it’s a basically-viable way to ground things.
I’m not sure how “explanations for corrigibility” would be relevant here (though I’m also not sure exactly what you’re picturing).
Just to clarify my meaning of explaining corrigibility: In my projects, my aim is not simply to enable GPT-2 XL to execute a shutdown procedure, but also to ensure that it is a thoroughly considered process. Additionally, I want to be able to examine the changes in mean and standard deviation of the 600,000 QKV weights.
Yes, I’m aware that it’s not a complete solution since I cannot explain why each individual weight changes. However, seeing the ATL successfully modify the network and observing the reliable response from GPT-2 XL is why I’m pursuing this method further.
Just to confirm, if a tuning process could alter the QKV weights, allowing the AI to provide explanations for corrigibility or to shut itself down, would this be the type of symbol-based corrigibility you are seeking?
I’m not sure how “explanations for corrigibility” would be relevant here (though I’m also not sure exactly what you’re picturing). If an AI had the capability to directly shut itself down, and were fine-tuned in an environment where it could use that ability and be rewarded accordingly, then testing its usage of that ability would definitely be a way to test shutdown-corrigibility. There are still subtleties to account for in the experiment setup (e.g. things mentioned here), but it’s a basically-viable way to ground things.
Thanks for your reply.
Just to clarify my meaning of explaining corrigibility: In my projects, my aim is not simply to enable GPT-2 XL to execute a shutdown procedure, but also to ensure that it is a thoroughly considered process. Additionally, I want to be able to examine the changes in mean and standard deviation of the 600,000 QKV weights.
Yes, I’m aware that it’s not a complete solution since I cannot explain why each individual weight changes. However, seeing the ATL successfully modify the network and observing the reliable response from GPT-2 XL is why I’m pursuing this method further.