To use your analogy, I think this is like a study showing that wearing a blindfold does decrease sight capabilities. It’s a proof of concept that you can make that change, even though the subject isn’t truly made blind, could possibly remove their own blindfold, etc.
I think this is notable because it highlights that LLMs (as they exist now) are not the expected-utility-maximizing agents which have all the negative results. It’s a very different landscape if we can make our AI act corrigible (but only in narrow ways which might be undone by prompt injection, etc, etc) versus if we’re infinitely far away from an AI having “an intuitive sense to “understanding that it might be flawed””.
To use your analogy, I think this is like a study showing that wearing a blindfold does decrease sight capabilities. It’s a proof of concept that you can make that change, even though the subject isn’t truly made blind, could possibly remove their own blindfold, etc.
I think this is notable because it highlights that LLMs (as they exist now) are not the expected-utility-maximizing agents which have all the negative results. It’s a very different landscape if we can make our AI act corrigible (but only in narrow ways which might be undone by prompt injection, etc, etc) versus if we’re infinitely far away from an AI having “an intuitive sense to “understanding that it might be flawed””.