Thanks for writing this, Will! I think it’s a good + clear explanation, and “high/low-bandwidth oversight” seems like a useful pair of labels.
Seconded! (I said this to William privately while helping to review his draft, and just realized that I should also say it publicly so more people will read this and get a better understanding of Paul’s approach.)
About “target”/“hitting the target”, maybe we should stick with Dario’s labels “training target” and “robustness”? He recently complained about people inventing new names for existing concepts, thereby causing confusion, and this seems to be a case where that’s pretty easy to avoid?
On the object level, Paul seems to think that IDA can also help with a large part of “robustness” / “problem 2″, namely transparency / interpretability / benign induction. See here.
BTW, I wish Dario would publish his Google Doc so more people can see and critique it. For example it doesn’t seem like any MIRI people have access to it. In general I’m pretty concerned about the effect where an idea may look better than it actually is because few people have investigated it deeply yet and pointed out the potential flaws. (I was saying this about Paul’s approach earlier, if you recall.) I was just shared on Dario’s Doc myself recently and am not sure whether it makes sense to try to discuss it within the Doc (given that few people will see the discussion) or here (where most people won’t have seen the full force of Dario’s arguments).
About “target”/“hitting the target”, maybe we should stick with Dario’s labels “training target” and “robustness”? He recently complained about people inventing new names for existing concepts, thereby causing confusion, and this seems to be a case where that’s pretty easy to avoid?
Except that “hitting the target” is much more evocative than “robustness” :/
Maybe “robustness (aka hitting the target)” would be okay to write.
I do agree. I think the main reason to stick with “robustness” or “reliability” is that that’s how the problems of “my model doesn’t generalize well / is subject to adversarial examples / didn’t really hit the training target outside the training data” are referred to in ML, and it gives a bad impression when people rename problems. I’m definitely most in favor of giving a new name like “hitting the target” if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)
“Target” and “hitting the target” seem to align very well with the terms “accuracy” and “precision”, so much so that almost all images of accuracy and precision feature a target. Maybe we could use terms like “Value Accuracy” and “Value Precision”, to be a bit more specific?
I think of “robustness” and “reliability” as aspects of “precision”, but it’s not the only ones. To me those two imply “improving on an already high chance of success”, rather than the challenge of getting anywhere close to begin with, without specific external constraints. The reason why they apply to rockets or things with chances of failure is that they are typically discussed more specifically to all the individual parts of such items. Another case of non-robust precision may be an AI trying to understand some of the much finer details of what’s considered “the human value function.”
[Edit] I don’t mean to proliferate the “new terms” problem, and don’t intend for these specific terms to get used in the future, but use them for demonstration.
Terminology names seem pretty important; blog posts and comments are good places for coming up with them, but are there more formal ways of coming to consensus on choosing some?
One place to start may be having someone make a list of all the currently used terms for a set of questions, then having a semi-formal system (even if it’s with around 5-10 people, as long as they are the main ones) to decide on the ones to use going forward.
I’d be happy to volunteer to do a writeup or so on of that sort.
OK—if it looks like the delay will be super long, we can certainly ask him how he’d be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that’s been shared for comments, and I think it makes sense to stick to them.
Seconded! (I said this to William privately while helping to review his draft, and just realized that I should also say it publicly so more people will read this and get a better understanding of Paul’s approach.)
About “target”/“hitting the target”, maybe we should stick with Dario’s labels “training target” and “robustness”? He recently complained about people inventing new names for existing concepts, thereby causing confusion, and this seems to be a case where that’s pretty easy to avoid?
On the object level, Paul seems to think that IDA can also help with a large part of “robustness” / “problem 2″, namely transparency / interpretability / benign induction. See here.
BTW, I wish Dario would publish his Google Doc so more people can see and critique it. For example it doesn’t seem like any MIRI people have access to it. In general I’m pretty concerned about the effect where an idea may look better than it actually is because few people have investigated it deeply yet and pointed out the potential flaws. (I was saying this about Paul’s approach earlier, if you recall.) I was just shared on Dario’s Doc myself recently and am not sure whether it makes sense to try to discuss it within the Doc (given that few people will see the discussion) or here (where most people won’t have seen the full force of Dario’s arguments).
Except that “hitting the target” is much more evocative than “robustness” :/
Maybe “robustness (aka hitting the target)” would be okay to write.
I do agree. I think the main reason to stick with “robustness” or “reliability” is that that’s how the problems of “my model doesn’t generalize well / is subject to adversarial examples / didn’t really hit the training target outside the training data” are referred to in ML, and it gives a bad impression when people rename problems. I’m definitely most in favor of giving a new name like “hitting the target” if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)
“Target” and “hitting the target” seem to align very well with the terms “accuracy” and “precision”, so much so that almost all images of accuracy and precision feature a target. Maybe we could use terms like “Value Accuracy” and “Value Precision”, to be a bit more specific?
I think of “robustness” and “reliability” as aspects of “precision”, but it’s not the only ones. To me those two imply “improving on an already high chance of success”, rather than the challenge of getting anywhere close to begin with, without specific external constraints. The reason why they apply to rockets or things with chances of failure is that they are typically discussed more specifically to all the individual parts of such items. Another case of non-robust precision may be an AI trying to understand some of the much finer details of what’s considered “the human value function.”
[Edit] I don’t mean to proliferate the “new terms” problem, and don’t intend for these specific terms to get used in the future, but use them for demonstration.
Terminology names seem pretty important; blog posts and comments are good places for coming up with them, but are there more formal ways of coming to consensus on choosing some?
One place to start may be having someone make a list of all the currently used terms for a set of questions, then having a semi-formal system (even if it’s with around 5-10 people, as long as they are the main ones) to decide on the ones to use going forward.
I’d be happy to volunteer to do a writeup or so on of that sort.
Even if only this step was accomplished, it seems worthwhile.
I agree re: terminology, but probably further discussion of unpublished docs should just wait until they’re published.
I got the sense from Dario that he has no plans to publish the document in the foreseeable future.
OK—if it looks like the delay will be super long, we can certainly ask him how he’d be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that’s been shared for comments, and I think it makes sense to stick to them.