Thanks for clarifying!
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as “don’t be deceptive” is analogous to “be neutral about humans pressing stop button.”
Thanks for clarifying!
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as “don’t be deceptive” is analogous to “be neutral about humans pressing stop button.”