There are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea.
I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
This helps, and I think it’s the part I don’t currently have a great intuition for. My best attempt at steel-manning would be something like: “It’s plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about.” (Where “correctly” here means “in a way that’s consistent with its builders’ wishes.”) And you could plausibly argue that an AGI would have a tendency to not induce distributions that it didn’t expect it would generalize correctly on, though I’m not sure if that’s the specific mechanism you had in mind.
It’s nothing quite so detailed as that. It’s more like “maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn’t a strong reason to expect one over the other”. (Which is why I only say it is plausible that the AI system works fine, rather than probable.)
You might think that the default expectation is that AI systems don’t generalize. But in the world where we’ve gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.
There are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea.
I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
Got it, thanks!
This helps, and I think it’s the part I don’t currently have a great intuition for. My best attempt at steel-manning would be something like: “It’s plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about.” (Where “correctly” here means “in a way that’s consistent with its builders’ wishes.”) And you could plausibly argue that an AGI would have a tendency to not induce distributions that it didn’t expect it would generalize correctly on, though I’m not sure if that’s the specific mechanism you had in mind.
It’s nothing quite so detailed as that. It’s more like “maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn’t a strong reason to expect one over the other”. (Which is why I only say it is plausible that the AI system works fine, rather than probable.)
You might think that the default expectation is that AI systems don’t generalize. But in the world where we’ve gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.
I see. Okay, I definitely agree that makes sense under the “fails to generalize” risk model. Thanks Rohin!