I agree with pretty much this whole comment, but do have one question:
But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we’ve retrained the model before we get to the exotic circumstances, etc), and it’s intent aligned in all the circumstances the model actually encounters.
Given that this is conditioned on us getting to AGI, wouldn’t the intuition here be that pretty much all the most valuable things such a system would do would fall under “exotic circumstances” with respect to any realistic training distribution? I might be assuming too much in saying that — e.g., I’m taking it for granted that anything we’d call an AGI could self-improve to the point of accessing states of the world that we wouldn’t be able to train it on; and also I’m assuming that the highest-reward states would probably be the these exotic / hard-to-access ones. But both of those do seem (to me) like they’d be the default expectation.
Or maybe you mean it seems plausible that, even under those exotic circumstances, an AGI may still be able to correctly infer our intent, and be incentivized to act in alignment with it?
There are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea.
I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
This helps, and I think it’s the part I don’t currently have a great intuition for. My best attempt at steel-manning would be something like: “It’s plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about.” (Where “correctly” here means “in a way that’s consistent with its builders’ wishes.”) And you could plausibly argue that an AGI would have a tendency to not induce distributions that it didn’t expect it would generalize correctly on, though I’m not sure if that’s the specific mechanism you had in mind.
It’s nothing quite so detailed as that. It’s more like “maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn’t a strong reason to expect one over the other”. (Which is why I only say it is plausible that the AI system works fine, rather than probable.)
You might think that the default expectation is that AI systems don’t generalize. But in the world where we’ve gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.
I agree with pretty much this whole comment, but do have one question:
Given that this is conditioned on us getting to AGI, wouldn’t the intuition here be that pretty much all the most valuable things such a system would do would fall under “exotic circumstances” with respect to any realistic training distribution? I might be assuming too much in saying that — e.g., I’m taking it for granted that anything we’d call an AGI could self-improve to the point of accessing states of the world that we wouldn’t be able to train it on; and also I’m assuming that the highest-reward states would probably be the these exotic / hard-to-access ones. But both of those do seem (to me) like they’d be the default expectation.
Or maybe you mean it seems plausible that, even under those exotic circumstances, an AGI may still be able to correctly infer our intent, and be incentivized to act in alignment with it?
There are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea.
I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
Got it, thanks!
This helps, and I think it’s the part I don’t currently have a great intuition for. My best attempt at steel-manning would be something like: “It’s plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about.” (Where “correctly” here means “in a way that’s consistent with its builders’ wishes.”) And you could plausibly argue that an AGI would have a tendency to not induce distributions that it didn’t expect it would generalize correctly on, though I’m not sure if that’s the specific mechanism you had in mind.
It’s nothing quite so detailed as that. It’s more like “maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn’t a strong reason to expect one over the other”. (Which is why I only say it is plausible that the AI system works fine, rather than probable.)
You might think that the default expectation is that AI systems don’t generalize. But in the world where we’ve gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.
I see. Okay, I definitely agree that makes sense under the “fails to generalize” risk model. Thanks Rohin!