Wei Dai comments on Three AI Safety Related Ideas

Wei Dai 24 Dec 2018 20:47 UTC
LW: 13 AF: 4
AF

The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?“

Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?

ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:

A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.

But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?
What links here?
- List of resolved confusions about IDA by Wei Dai (30 Sep 2019 20:03 UTC; 97 points)