But I’m not sure how to reconcile that with the empirical evidence that deep networks are robust to massive label noise: you can train on MNIST digits with twenty wrong labels for every correct one and still get good performance as long as the correct label is slightly more common than the most common wrong label. If I extrapolate that to the frontier AIs of tomorrow, why doesn’t that predict that biased human reward ratings should result in a small performance reduction, rather than … death?
And how many errors, at what level of AGI capabilities, are sufficient to lead to human extinction? That’s already beyond the bare minimum level of reliability you need, the upper bound on how many errors you can tolerate. The answer doesn’t look anything like the 90% accuracy found in the linked paper if the scenario were actually a high-powered AGI that will be used a vast number of times.
This is a great question; I’ve never seen a convincing answer or even a good start at figuring out how many errors in ASI alignment we can tolerate before we’re likely to die.
If each action it takes has independent errors, we’d need near 100% accuracy to expect to survive more than a little while. But if its beliefs are coherent through reflection, those errors aren’t independent. I don’t expect ASI to be merely a bigger network that takes an input and spits out an output, but a system that can and does reflect on its own goals and beliefs (because this isn’t hard to implement and introspection and reflection seem useful for human cognition). Having said that, this might actually be a crux of disagreement on alignment difficulty—I’d be more scared of an ASI that can’t reflect so that its errors are independent.
With reflection, a human wouldn’t just say “seems like I should kill everyone this time” and then do it. They’d wonder why this decision is so different from their usual decisions, and look for errors.
So the more relevant question, I think, is how many errors and how large can be tolerated in the formation of a set of coherent, reflectively stable goals. But that’s with my expectation of a reflective AGI with coherent goals and behaviors.
And how many errors, at what level of AGI capabilities, are sufficient to lead to human extinction? That’s already beyond the bare minimum level of reliability you need, the upper bound on how many errors you can tolerate. The answer doesn’t look anything like the 90% accuracy found in the linked paper if the scenario were actually a high-powered AGI that will be used a vast number of times.
This is a great question; I’ve never seen a convincing answer or even a good start at figuring out how many errors in ASI alignment we can tolerate before we’re likely to die.
If each action it takes has independent errors, we’d need near 100% accuracy to expect to survive more than a little while. But if its beliefs are coherent through reflection, those errors aren’t independent. I don’t expect ASI to be merely a bigger network that takes an input and spits out an output, but a system that can and does reflect on its own goals and beliefs (because this isn’t hard to implement and introspection and reflection seem useful for human cognition). Having said that, this might actually be a crux of disagreement on alignment difficulty—I’d be more scared of an ASI that can’t reflect so that its errors are independent.
With reflection, a human wouldn’t just say “seems like I should kill everyone this time” and then do it. They’d wonder why this decision is so different from their usual decisions, and look for errors.
So the more relevant question, I think, is how many errors and how large can be tolerated in the formation of a set of coherent, reflectively stable goals. But that’s with my expectation of a reflective AGI with coherent goals and behaviors.