Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not “Building this kind of AGI is a great idea”, but rather “This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I’m correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there’s some chance they’ll succeed whether we like it or not.”
FWIW, I’m now thinking of your “value function” as expected utility in Jeffrey-Bolker terms.
Thanks, that’s helpful.
how do we define whether a value function is “aligned” (in an inner sense, so, when compared to an outer objective which is being used for training it)?
One way I think I would frame the problem differently than you here is: I’m happy to talk about outer and inner alignment for pedagogical purposes, but I think it’s overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don’t need to be the only type of feedback.
We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don’t use it in time-sensitive situations! But that’s fine for working with humans to build better-aligned AGIs.) So then the goal would be something like “every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)”.
So then this could look like a pretty weirdly misaligned AGI but it has a really effective “may-lead-to-catastrophe (directly or indirectly) predictor circuit” attached. (The circuit asks “Does it pattern-match to murder? Does it pattern-match to deception? Does it pattern-match to ‘things that might upset lots of people’? Does it pattern-match to ‘things that respectable people don’t normally do’?...”) And the circuit magically never has any false-negatives. Anyway, in that case the framework of “how well are we approximating the intended value function?” isn’t quite the right framing, I think.
I think we need stuff from my ‘learning normativity’ agenda to dodge these bullets.
Yeah I’m very sympathetic to the spirit of that. I’m a bit stumped on how those ideas could be implemented, but it’s certainly in the space of things that I continue to brainstorm about...
Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not “Building this kind of AGI is a great idea”, but rather “This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I’m correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there’s some chance they’ll succeed whether we like it or not.”
Thanks, that’s helpful.
One way I think I would frame the problem differently than you here is: I’m happy to talk about outer and inner alignment for pedagogical purposes, but I think it’s overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don’t need to be the only type of feedback.
We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don’t use it in time-sensitive situations! But that’s fine for working with humans to build better-aligned AGIs.) So then the goal would be something like “every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)”.
So then this could look like a pretty weirdly misaligned AGI but it has a really effective “may-lead-to-catastrophe (directly or indirectly) predictor circuit” attached. (The circuit asks “Does it pattern-match to murder? Does it pattern-match to deception? Does it pattern-match to ‘things that might upset lots of people’? Does it pattern-match to ‘things that respectable people don’t normally do’?...”) And the circuit magically never has any false-negatives. Anyway, in that case the framework of “how well are we approximating the intended value function?” isn’t quite the right framing, I think.
Yeah I’m very sympathetic to the spirit of that. I’m a bit stumped on how those ideas could be implemented, but it’s certainly in the space of things that I continue to brainstorm about...