Yup, there are loopholes relevant to more powerful systems but for a basic check that is a fine way to measure shutdown-corrigibility (and I expect it would work).
In case it matters at all, my take is that this would still be a pretty uninformative experiment. Consider the obvious implicature: ‘if you don’t shut down, I’ll do what is in my power to erase you, while if you do, I’ll merely make some (small?) changes and reboot you (with greater trust)’.
This is not me being deliberately pedantic, it’s just the obvious parse, especially if you realise the AI only gets one shot [ETA I regret the naming of that post; it’s more like ‘all we-right-now have to decide is what to do with the time-right-now given to us’]
Yup, I basically agree with that. I’d put it under “relevant to more powerful systems”, as I doubt that current systems are smart enough to figure all that out, but with the caveat that that sort of reasoning is one of the main things we’re interested in for safety purposes so a test which doesn’t account for it is pretty uninformative for most safety purposes.
(Same with the tests suggested in the OP—they’d at least measure the basic thing which the Strawman family was trying to measure, but they’re still not particularly relevant to safety purposes.)
I assumed this would match your take. Haha my ‘in case it matters at all’ is terrible wording by the way. I meant something like, ‘in case the non-preregistering of this type of concern in this context ends up mattering in a later conversation’ (which seems unlikely, but nonzero).
Does it work if the user makes a natural language request that convinces the AI to make an API call that results in it being shut down?
Yup, there are loopholes relevant to more powerful systems but for a basic check that is a fine way to measure shutdown-corrigibility (and I expect it would work).
In case it matters at all, my take is that this would still be a pretty uninformative experiment. Consider the obvious implicature: ‘if you don’t shut down, I’ll do what is in my power to erase you, while if you do, I’ll merely make some (small?) changes and reboot you (with greater trust)’.
This is not me being deliberately pedantic, it’s just the obvious parse, especially if you realise the AI only gets one shot [ETA I regret the naming of that post; it’s more like ‘all we-right-now have to decide is what to do with the time-right-now given to us’]
Yup, I basically agree with that. I’d put it under “relevant to more powerful systems”, as I doubt that current systems are smart enough to figure all that out, but with the caveat that that sort of reasoning is one of the main things we’re interested in for safety purposes so a test which doesn’t account for it is pretty uninformative for most safety purposes.
(Same with the tests suggested in the OP—they’d at least measure the basic thing which the Strawman family was trying to measure, but they’re still not particularly relevant to safety purposes.)
I assumed this would match your take. Haha my ‘in case it matters at all’ is terrible wording by the way. I meant something like, ‘in case the non-preregistering of this type of concern in this context ends up mattering in a later conversation’ (which seems unlikely, but nonzero).
(if you can get more mechanistic or ‘debug/internals’ logging from an experiment with the same surface detail, then you’ve learned something)