Deep Learning is notorious for continuing to work while there are mistakes in it; you can accidentally leave all kinds of things out and it still works just fine. There are of course arguments that value is fragile and that if we get it 0.1% wrong then we lose 99.9% of all value in the universe, just as there are arguments that the aforementioned arguments are quite wrong. But “one mistake” = “failure” is not a firm principle in other areas of engineering, so it’s unlikely to be the case.
Ok the “An AGI that has at least one mistake in its alignment model will be unaligned” premise seems like the weakest one. Is there any agreement in the AI community about how much alignment is “enough”? I suppose it depends on the AI capabilities and how long you want to have it running for. Are there any estimates?
For example if humanity wanted a 50% chance of surviving another 10 years in the presence of a meta-stably-aligned ASI, the ASI would need a daily non-failure rate of x^(365.25*10)=0.5, x=0.99981 or better.
Ok the “An AGI that has at least one mistake in its alignment model will be unaligned” premise seems like the weakest one. Is there any agreement in the AI community about how much alignment is “enough”? I suppose it depends on the AI capabilities and how long you want to have it running for. Are there any estimates?
For example if humanity wanted a 50% chance of surviving another 10 years in the presence of a meta-stably-aligned ASI, the ASI would need a daily non-failure rate of x^(365.25*10)=0.5, x=0.99981 or better.