paulfchristiano comments on Can corrigibility be learned safely?

paulfchristiano 8 Apr 2018 20:33 UTC
LW: 7 AF: 3
AF
he way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”
I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise...
for it to be safe it should consult you before going ahead with any one of these
OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn’t take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe.
I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I’m calling “alignment”) we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
Using Scott Garrabrant’s terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what’s needed for that try to get robustness to relative scale, then once we understand what’s needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it’s definitely the easiest to get empirical feedback about. It’s also the one for which we learn the most from ongoing AI progress.
- Wei Dai 10 Apr 2018 7:42 UTC
  LW: 3 AF: 2
  AF Parent
  I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise…
  By “metaphilosophical competence” zhukeepa means to include philosophical competence and rationality (which I guess includes having the right priors and using information efficiently in all fields of study including understanding humans, historical knowledge, physics expertise). (I wish he would be more explicit about that to avoid confusion.)
  I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others.
  Why is this implausible, given that we don’t yet know that meta-execution with humans acting on small inputs is universal? And even if it’s universal, meta-execution may be more efficient (requires fewer amplifications to reach a certain level of performance) in some areas than others, and therefore the resulting AI could be very smart in some ways and dumb in others at a given level of amplification.
  Do you think that’s not the case, or that the strong/weak areas of meta-execution do not line up the way zhukeepa expects? To put it another way, when IDA reaches roughly human-level intelligence, which areas do you expect it to be smarter than human, which dumber than human? (I’m trying to improve my understanding and intuitions about meta-execution so I can better judge this myself.)
  In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
  Your scheme depends on both meta-execution and ML, and it only takes one of them to be dumb in some area for the resulting AI to be dumb in that area. Also, what existing ML system are you talking about? Is it something someone has already built, or are you imagining something we could build with current ML technology?