I really like that list of points! Not that I’m Rob, but I’d mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I’d expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I’d be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about “black swans” popping up. And when I said:
2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it’s critical that it knows to check that action with you).
I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn’t mean “the AI is incorrigible if it’s not high-impact calibrated”, I meant “the AI, even if corrigible, would be unsafe it’s not high-impact calibrated”.)
If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
I think I understand your position much better now. The way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”, and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it’s not good enough at figuring out what is right, even if it’s corrigible.
he way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”
I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise...
for it to be safe it should consult you before going ahead with any one of these
OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn’t take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe.
I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I’m calling “alignment”) we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
Using Scott Garrabrant’s terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what’s needed for that try to get robustness to relative scale, then once we understand what’s needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it’s definitely the easiest to get empirical feedback about. It’s also the one for which we learn the most from ongoing AI progress.
I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise…
By “metaphilosophical competence” zhukeepa means to include philosophical competence and rationality (which I guess includes having the right priors and using information efficiently in all fields of study including understanding humans, historical knowledge, physics expertise). (I wish he would be more explicit about that to avoid confusion.)
I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others.
Why is this implausible, given that we don’t yet know that meta-execution with humans acting on small inputs is universal? And even if it’s universal, meta-execution may be more efficient (requires fewer amplifications to reach a certain level of performance) in some areas than others, and therefore the resulting AI could be very smart in some ways and dumb in others at a given level of amplification.
Do you think that’s not the case, or that the strong/weak areas of meta-execution do not line up the way zhukeepa expects? To put it another way, when IDA reaches roughly human-level intelligence, which areas do you expect it to be smarter than human, which dumber than human? (I’m trying to improve my understanding and intuitions about meta-execution so I can better judge this myself.)
In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
Your scheme depends on both meta-execution and ML, and it only takes one of them to be dumb in some area for the resulting AI to be dumb in that area. Also, what existing ML system are you talking about? Is it something someone has already built, or are you imagining something we could build with current ML technology?
I really like that list of points! Not that I’m Rob, but I’d mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I’d expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I’d be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about “black swans” popping up. And when I said:
I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn’t mean “the AI is incorrigible if it’s not high-impact calibrated”, I meant “the AI, even if corrigible, would be unsafe it’s not high-impact calibrated”.)
I think I understand your position much better now. The way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”, and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it’s not good enough at figuring out what is right, even if it’s corrigible.
I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise...
OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn’t take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe.
I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I’m calling “alignment”) we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
Using Scott Garrabrant’s terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what’s needed for that try to get robustness to relative scale, then once we understand what’s needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it’s definitely the easiest to get empirical feedback about. It’s also the one for which we learn the most from ongoing AI progress.
By “metaphilosophical competence” zhukeepa means to include philosophical competence and rationality (which I guess includes having the right priors and using information efficiently in all fields of study including understanding humans, historical knowledge, physics expertise). (I wish he would be more explicit about that to avoid confusion.)
Why is this implausible, given that we don’t yet know that meta-execution with humans acting on small inputs is universal? And even if it’s universal, meta-execution may be more efficient (requires fewer amplifications to reach a certain level of performance) in some areas than others, and therefore the resulting AI could be very smart in some ways and dumb in others at a given level of amplification.
Do you think that’s not the case, or that the strong/weak areas of meta-execution do not line up the way zhukeepa expects? To put it another way, when IDA reaches roughly human-level intelligence, which areas do you expect it to be smarter than human, which dumber than human? (I’m trying to improve my understanding and intuitions about meta-execution so I can better judge this myself.)
Your scheme depends on both meta-execution and ML, and it only takes one of them to be dumb in some area for the resulting AI to be dumb in that area. Also, what existing ML system are you talking about? Is it something someone has already built, or are you imagining something we could build with current ML technology?