He thinks that as AI systems get more powerful, they will actually become more interpretable because they will use features that humans also tend to use
I find this fairly persuasive, I think. One way of putting it is that in order for an agent to be recursively self-improving in any remotely intelligent way, it needs to be legible to itself. Even if we can’t immediately understand its components in the same way that it does, it must necessarily provide us with descriptions of its own ways of understanding them, which we could then potentially co-opt. (relevant: https://www.lesswrong.com/posts/bNXdnRTpSXk9p4zmi/book-review-design-principles-of-biological-circuits )
This may be useful in the early phases, but I’m skeptical as to whether humans can import those new ways of understanding fast enough to be permitted to stand as an air-gap for very long. There is a reason, for instance, we don’t have humans looking over and approving every credit card transaction. Taking humans out of the loop is the entire reason those systems are useful. The same dynamic will pop up with AGI.
I find this fairly persuasive, I think. One way of putting it is that in order for an agent to be recursively self-improving in any remotely intelligent way, it needs to be legible to itself. Even if we can’t immediately understand its components in the same way that it does, it must necessarily provide us with descriptions of its own ways of understanding them, which we could then potentially co-opt. (relevant: https://www.lesswrong.com/posts/bNXdnRTpSXk9p4zmi/book-review-design-principles-of-biological-circuits )
This may be useful in the early phases, but I’m skeptical as to whether humans can import those new ways of understanding fast enough to be permitted to stand as an air-gap for very long. There is a reason, for instance, we don’t have humans looking over and approving every credit card transaction. Taking humans out of the loop is the entire reason those systems are useful. The same dynamic will pop up with AGI.
This xkcd comic seems relevant https://xkcd.com/2044/ (“sandboxing cycle”)
There is a tension between connectivity and safe isolation and navigating it is hard.