Thanks for writing this, Will! I think it’s a good + clear explanation, and “high/low-bandwidth oversight” seems like a useful pair of labels.
I’ve recently found it useful to think about two kind-of-separate aspects of alignment (I think I first saw these clearly separated by Dario in an unpublished Google Doc):
1. “target”: can we define what we mean by “good behavior” in a way that seems in-principle learnable, ignoring the difficulty of learning reliably / generalizing well / being secure? E.g. in RL, this would be the Bellman equation or recursive definition of the Q-function. The basic issue here is that it’s super unclear what it means to “do what the human wants, but scale up capabilities far beyond the human’s”.
2. “hitting the target”: given a target, can we learn it in a way that generalizes “well”? This problem is very close to the reliability / security problem a lot of ML folks are thinking about, though our emphasis and methods might be somewhat different. Ideally our learning method would be very reliable, but the critical thing is that we should be very unlikely to learn a policy that is powerfully optimizing for some other target (malign failure / daemon). E.g. inclusive genetic fitness is a fine target, but the learning method got humans instead—oops.
I’ve largely been optimistic about IDA because it looks like a really good step forward for our understanding of problem 1 (in particular because it takes a very different angle from CIRL-like methods that try to learn some internal values-ish function by observing human actions). 2 wasn’t really on my radar before (maybe because problem 1 was so open / daunting / obviously critical); now it seems like a huge deal to me, largely thanks to Paul, Wei Dai, some unpublished Dario stuff, and more recently some MIRI conversations.
Current state:
I do think problem 2 is super-worrying for IDA, and probably for all ML-ish approaches to alignment? If there are arguments that different approaches are better on problem 2, I’d love to see them. Problem 2 seems like the most likely reason right now that we’ll later be saying “uh, we can’t make aligned AI, time to get really persuasive to the rest of the world that AI is very difficult to use safely”.
I’m optimistic about people sometimes choosing only problem 1 or problem 2 to focus on with a particular piece of work—it seems like “solve both problems in one shot” is too high a bar for any one piece of work. It’s most obvious that you can choose to work on problem 2 and set aside problem 1 temporarily—a ton of ML people are doing this productively—but I also think it’s possible and probably useful to sometimes say “let’s map out the space of possible solutions to problem 1, and maybe propose a family of new ones, w/o diving super deep on problem 2 for now.”
Thanks for writing this, Will! I think it’s a good + clear explanation, and “high/low-bandwidth oversight” seems like a useful pair of labels.
Seconded! (I said this to William privately while helping to review his draft, and just realized that I should also say it publicly so more people will read this and get a better understanding of Paul’s approach.)
About “target”/“hitting the target”, maybe we should stick with Dario’s labels “training target” and “robustness”? He recently complained about people inventing new names for existing concepts, thereby causing confusion, and this seems to be a case where that’s pretty easy to avoid?
On the object level, Paul seems to think that IDA can also help with a large part of “robustness” / “problem 2″, namely transparency / interpretability / benign induction. See here.
BTW, I wish Dario would publish his Google Doc so more people can see and critique it. For example it doesn’t seem like any MIRI people have access to it. In general I’m pretty concerned about the effect where an idea may look better than it actually is because few people have investigated it deeply yet and pointed out the potential flaws. (I was saying this about Paul’s approach earlier, if you recall.) I was just shared on Dario’s Doc myself recently and am not sure whether it makes sense to try to discuss it within the Doc (given that few people will see the discussion) or here (where most people won’t have seen the full force of Dario’s arguments).
About “target”/“hitting the target”, maybe we should stick with Dario’s labels “training target” and “robustness”? He recently complained about people inventing new names for existing concepts, thereby causing confusion, and this seems to be a case where that’s pretty easy to avoid?
Except that “hitting the target” is much more evocative than “robustness” :/
Maybe “robustness (aka hitting the target)” would be okay to write.
I do agree. I think the main reason to stick with “robustness” or “reliability” is that that’s how the problems of “my model doesn’t generalize well / is subject to adversarial examples / didn’t really hit the training target outside the training data” are referred to in ML, and it gives a bad impression when people rename problems. I’m definitely most in favor of giving a new name like “hitting the target” if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)
“Target” and “hitting the target” seem to align very well with the terms “accuracy” and “precision”, so much so that almost all images of accuracy and precision feature a target. Maybe we could use terms like “Value Accuracy” and “Value Precision”, to be a bit more specific?
I think of “robustness” and “reliability” as aspects of “precision”, but it’s not the only ones. To me those two imply “improving on an already high chance of success”, rather than the challenge of getting anywhere close to begin with, without specific external constraints. The reason why they apply to rockets or things with chances of failure is that they are typically discussed more specifically to all the individual parts of such items. Another case of non-robust precision may be an AI trying to understand some of the much finer details of what’s considered “the human value function.”
[Edit] I don’t mean to proliferate the “new terms” problem, and don’t intend for these specific terms to get used in the future, but use them for demonstration.
Terminology names seem pretty important; blog posts and comments are good places for coming up with them, but are there more formal ways of coming to consensus on choosing some?
One place to start may be having someone make a list of all the currently used terms for a set of questions, then having a semi-formal system (even if it’s with around 5-10 people, as long as they are the main ones) to decide on the ones to use going forward.
I’d be happy to volunteer to do a writeup or so on of that sort.
OK—if it looks like the delay will be super long, we can certainly ask him how he’d be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that’s been shared for comments, and I think it makes sense to stick to them.
Thanks for writing this, Will! I think it’s a good + clear explanation, and “high/low-bandwidth oversight” seems like a useful pair of labels.
I’ve recently found it useful to think about two kind-of-separate aspects of alignment (I think I first saw these clearly separated by Dario in an unpublished Google Doc):
1. “target”: can we define what we mean by “good behavior” in a way that seems in-principle learnable, ignoring the difficulty of learning reliably / generalizing well / being secure? E.g. in RL, this would be the Bellman equation or recursive definition of the Q-function. The basic issue here is that it’s super unclear what it means to “do what the human wants, but scale up capabilities far beyond the human’s”.
2. “hitting the target”: given a target, can we learn it in a way that generalizes “well”? This problem is very close to the reliability / security problem a lot of ML folks are thinking about, though our emphasis and methods might be somewhat different. Ideally our learning method would be very reliable, but the critical thing is that we should be very unlikely to learn a policy that is powerfully optimizing for some other target (malign failure / daemon). E.g. inclusive genetic fitness is a fine target, but the learning method got humans instead—oops.
I’ve largely been optimistic about IDA because it looks like a really good step forward for our understanding of problem 1 (in particular because it takes a very different angle from CIRL-like methods that try to learn some internal values-ish function by observing human actions). 2 wasn’t really on my radar before (maybe because problem 1 was so open / daunting / obviously critical); now it seems like a huge deal to me, largely thanks to Paul, Wei Dai, some unpublished Dario stuff, and more recently some MIRI conversations.
Current state:
I do think problem 2 is super-worrying for IDA, and probably for all ML-ish approaches to alignment? If there are arguments that different approaches are better on problem 2, I’d love to see them. Problem 2 seems like the most likely reason right now that we’ll later be saying “uh, we can’t make aligned AI, time to get really persuasive to the rest of the world that AI is very difficult to use safely”.
I’m optimistic about people sometimes choosing only problem 1 or problem 2 to focus on with a particular piece of work—it seems like “solve both problems in one shot” is too high a bar for any one piece of work. It’s most obvious that you can choose to work on problem 2 and set aside problem 1 temporarily—a ton of ML people are doing this productively—but I also think it’s possible and probably useful to sometimes say “let’s map out the space of possible solutions to problem 1, and maybe propose a family of new ones, w/o diving super deep on problem 2 for now.”
Seconded! (I said this to William privately while helping to review his draft, and just realized that I should also say it publicly so more people will read this and get a better understanding of Paul’s approach.)
About “target”/“hitting the target”, maybe we should stick with Dario’s labels “training target” and “robustness”? He recently complained about people inventing new names for existing concepts, thereby causing confusion, and this seems to be a case where that’s pretty easy to avoid?
On the object level, Paul seems to think that IDA can also help with a large part of “robustness” / “problem 2″, namely transparency / interpretability / benign induction. See here.
BTW, I wish Dario would publish his Google Doc so more people can see and critique it. For example it doesn’t seem like any MIRI people have access to it. In general I’m pretty concerned about the effect where an idea may look better than it actually is because few people have investigated it deeply yet and pointed out the potential flaws. (I was saying this about Paul’s approach earlier, if you recall.) I was just shared on Dario’s Doc myself recently and am not sure whether it makes sense to try to discuss it within the Doc (given that few people will see the discussion) or here (where most people won’t have seen the full force of Dario’s arguments).
Except that “hitting the target” is much more evocative than “robustness” :/
Maybe “robustness (aka hitting the target)” would be okay to write.
I do agree. I think the main reason to stick with “robustness” or “reliability” is that that’s how the problems of “my model doesn’t generalize well / is subject to adversarial examples / didn’t really hit the training target outside the training data” are referred to in ML, and it gives a bad impression when people rename problems. I’m definitely most in favor of giving a new name like “hitting the target” if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)
“Target” and “hitting the target” seem to align very well with the terms “accuracy” and “precision”, so much so that almost all images of accuracy and precision feature a target. Maybe we could use terms like “Value Accuracy” and “Value Precision”, to be a bit more specific?
I think of “robustness” and “reliability” as aspects of “precision”, but it’s not the only ones. To me those two imply “improving on an already high chance of success”, rather than the challenge of getting anywhere close to begin with, without specific external constraints. The reason why they apply to rockets or things with chances of failure is that they are typically discussed more specifically to all the individual parts of such items. Another case of non-robust precision may be an AI trying to understand some of the much finer details of what’s considered “the human value function.”
[Edit] I don’t mean to proliferate the “new terms” problem, and don’t intend for these specific terms to get used in the future, but use them for demonstration.
Terminology names seem pretty important; blog posts and comments are good places for coming up with them, but are there more formal ways of coming to consensus on choosing some?
One place to start may be having someone make a list of all the currently used terms for a set of questions, then having a semi-formal system (even if it’s with around 5-10 people, as long as they are the main ones) to decide on the ones to use going forward.
I’d be happy to volunteer to do a writeup or so on of that sort.
Even if only this step was accomplished, it seems worthwhile.
I agree re: terminology, but probably further discussion of unpublished docs should just wait until they’re published.
I got the sense from Dario that he has no plans to publish the document in the foreseeable future.
OK—if it looks like the delay will be super long, we can certainly ask him how he’d be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that’s been shared for comments, and I think it makes sense to stick to them.