I hadn’t seen value learning, thank you! I am familiar with Stuart Russel’s inverse reinforcement learning, which I think is very similar, and closer to a implementable proposal. I am not enthusiastic about IRL. The proposal there is to infer a human’s value function from their behavior, or from the behavior they reward in their agents. To me this seems like a very clumsy solution relative to asking the human what they want when it’s unclear and the consequences are important. That’s what I’m proposing is the obvious and simple approach that will likely be tried. That could be coupled with IRL.
My mental model here is not “figure out what we mean, then do it”, but “infer what I mean based on your models of human language, then check with me if your estimate of consequences are past this threshold I set, or if you have conflicting models of what I might mean”. You probably would want some cumulative learning of likely intentions, but you would not want to relax the criteria for checking before executing consequential plans by very much.
IRL or other value learning alone puts the weight of understanding human ethics/value function on the AI system. Even if it works, current human ethics/value functions might be an extraordinarily bad outer alignment target. It could be that maximizing our revealed preference leads to all-against-all competition or war, or the elimination of humanity in favor of better fits to our inferred value function. We don’t know what we want, so we don’t know what we’d get from having an AGI figure out what we really want. See Moral Reality Check (a short story) and my comment on it. So I’d prefer we figure out what we want for ourselves, and I think that’s going to be a very common motivation among humans. The “long contemplation” suggestion appears to be a common one among people thinking about outer alignment targets.
I hadn’t seen value learning, thank you! I am familiar with Stuart Russel’s inverse reinforcement learning, which I think is very similar, and closer to a implementable proposal. I am not enthusiastic about IRL. The proposal there is to infer a human’s value function from their behavior, or from the behavior they reward in their agents. To me this seems like a very clumsy solution relative to asking the human what they want when it’s unclear and the consequences are important. That’s what I’m proposing is the obvious and simple approach that will likely be tried. That could be coupled with IRL.
My mental model here is not “figure out what we mean, then do it”, but “infer what I mean based on your models of human language, then check with me if your estimate of consequences are past this threshold I set, or if you have conflicting models of what I might mean”. You probably would want some cumulative learning of likely intentions, but you would not want to relax the criteria for checking before executing consequential plans by very much.
IRL or other value learning alone puts the weight of understanding human ethics/value function on the AI system. Even if it works, current human ethics/value functions might be an extraordinarily bad outer alignment target. It could be that maximizing our revealed preference leads to all-against-all competition or war, or the elimination of humanity in favor of better fits to our inferred value function. We don’t know what we want, so we don’t know what we’d get from having an AGI figure out what we really want. See Moral Reality Check (a short story) and my comment on it. So I’d prefer we figure out what we want for ourselves, and I think that’s going to be a very common motivation among humans. The “long contemplation” suggestion appears to be a common one among people thinking about outer alignment targets.