Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
I agree that “AI systems are likely to generalize differently from humans.” I strongly believe we shouldn’t rest AI alignment on detailed claims about how an AI will generalize to a new distribution. (Though I do think we can hope to avoid errors of commission on a new distribution.)
Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
I think my present view is something like a conjunction of:
1. An AI needs to learn human representations in order to generalize like a human does.
2. For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
3. For a very broad range of narrow tasks, the AI does not need to generalize like a human does in order to be safe (or, it’s easy for it to generalize like a human). Go is in this category, ZFC theorem-provers are probably in this category, and I can imagine a large swath of engineering automation also falls into this category.
4. To the extent that “general and open-ended tasks” can be broken down into narrow tasks that don’t require human generalization, they don’t require human generalization to learn safely.
My current understanding is that we agree on (3) and (4), and that you either think that (2) is false, or that it’s true but the bar for “sufficiently general and open-ended” is really high, and tasks like achieving global stability can be safely broken down into safe narrow tasks. Does this sound right to you?
I’m confused about your thoughts on (1).
(I’m currently rereading your blog posts to get a better sense of your models of how you think broad and general tasks can get broken down into narrow ones.)
For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
This recent post is relevant to my thinking here. For the performance guarantee, you only care about what happens on the training distribution. For the control guarantee, “generalize like a human” doesn’t seem like the only strategy, or even an especially promising strategy.
I assume you think some different kind of guarantee is needed. My best guess is that you expect we’ll have a system that is trying to do what we want, but is very alien and unable to tell what kinds of mistakes might be catastrophic to us, and that there are enough opportunities for catastrophic error that it is likely to make one.
Let me know if that’s wrong.
If that’s right, I think the difference is: I see subtle benign catastrophic errors as quite rare, such that they are quantitatively a much smaller problem than what I’m calling AI alignment, whereas you seem to think they are extremely common. (Moreover, the benign catastrophic risks I see are also mostly things like “accidentally start a nuclear war,” for which “make sure the AI generalizes like a human” is not a especially great response. But I think that’s just because I’m not seeing some big class of benign catastrophic risks that seem obvious to you, so it’s just a restatement of the same difference.)
Could you explain a bit more what kind subtle benign mistake you expect to be catastrophic?
Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
I agree that “AI systems are likely to generalize differently from humans.” I strongly believe we shouldn’t rest AI alignment on detailed claims about how an AI will generalize to a new distribution. (Though I do think we can hope to avoid errors of commission on a new distribution.)
I think my present view is something like a conjunction of:
1. An AI needs to learn human representations in order to generalize like a human does.
2. For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
3. For a very broad range of narrow tasks, the AI does not need to generalize like a human does in order to be safe (or, it’s easy for it to generalize like a human). Go is in this category, ZFC theorem-provers are probably in this category, and I can imagine a large swath of engineering automation also falls into this category.
4. To the extent that “general and open-ended tasks” can be broken down into narrow tasks that don’t require human generalization, they don’t require human generalization to learn safely.
My current understanding is that we agree on (3) and (4), and that you either think that (2) is false, or that it’s true but the bar for “sufficiently general and open-ended” is really high, and tasks like achieving global stability can be safely broken down into safe narrow tasks. Does this sound right to you?
I’m confused about your thoughts on (1).
(I’m currently rereading your blog posts to get a better sense of your models of how you think broad and general tasks can get broken down into narrow ones.)
This recent post is relevant to my thinking here. For the performance guarantee, you only care about what happens on the training distribution. For the control guarantee, “generalize like a human” doesn’t seem like the only strategy, or even an especially promising strategy.
I assume you think some different kind of guarantee is needed. My best guess is that you expect we’ll have a system that is trying to do what we want, but is very alien and unable to tell what kinds of mistakes might be catastrophic to us, and that there are enough opportunities for catastrophic error that it is likely to make one.
Let me know if that’s wrong.
If that’s right, I think the difference is: I see subtle benign catastrophic errors as quite rare, such that they are quantitatively a much smaller problem than what I’m calling AI alignment, whereas you seem to think they are extremely common. (Moreover, the benign catastrophic risks I see are also mostly things like “accidentally start a nuclear war,” for which “make sure the AI generalizes like a human” is not a especially great response. But I think that’s just because I’m not seeing some big class of benign catastrophic risks that seem obvious to you, so it’s just a restatement of the same difference.)
Could you explain a bit more what kind subtle benign mistake you expect to be catastrophic?