I expect that it is much more likely that most people are looking at the current state of the art and don’t even know or think about other possible systems and just narrowly focus on aligning the state of the art, not considering creating a “new paradigm”, because they think that would just take too long.
I would be surprised if there were a lot of people who carefully thought about the topic and used the following reasoning procedure:
“Well, we could build AGI in an understandable way, where we just discover the algorithms of intelligence. But this would be bad because then we would understand intelligence very well, which means that the system is very capable. So because we understand it so well now, it makes it easier for us to figure out how to do lots of more capability stuff with the system, like making it recursively self-improving. Also, if the system is inherently more understandable, then it would also be easier for the AI to self-modify because understanding itself would be easier. So all of this seems bad, so instead we shouldn’t try to understand our systems. Instead, we should use neural networks, which we don’t understand at all, and use SGD in order to optimize the parameters of the neural network such that they correspond to the algorithms of intelligence, but are represented in such a format that we have no idea what’s going on at all. That is much safer because now it will be harder to understand the algorithms of intelligence, making it harder to improve and use. Also if an AI would look at itself as a neural network, it would be at least a bit harder for it to figure out how to recursively self-improve.”
Obviously, alignment is a really hard problem and it is actually very helpful to understand what is going on in your system at the algorithmic level in order to figure out what’s wrong with that specific algorithm. How is it not aligned? And how would we need to change it in order to make it aligned? At least, that’s what I expect. I think not using an approach where the system is interpretable hurts alignment more than capabilities. People have been steadily making progress at making our systems more capable and not understanding them at all, in terms of what algorithms they run inside, doesn’t seem to be much of an issue there, however for alignment that’s a huge issue.
I expect that it is much more likely that most people are looking at the current state of the art and don’t even know or think about other possible systems and just narrowly focus on aligning the state of the art, not considering creating a “new paradigm”, because they think that would just take too long.
I would be surprised if there were a lot of people who carefully thought about the topic and used the following reasoning procedure:
“Well, we could build AGI in an understandable way, where we just discover the algorithms of intelligence. But this would be bad because then we would understand intelligence very well, which means that the system is very capable. So because we understand it so well now, it makes it easier for us to figure out how to do lots of more capability stuff with the system, like making it recursively self-improving. Also, if the system is inherently more understandable, then it would also be easier for the AI to self-modify because understanding itself would be easier. So all of this seems bad, so instead we shouldn’t try to understand our systems. Instead, we should use neural networks, which we don’t understand at all, and use SGD in order to optimize the parameters of the neural network such that they correspond to the algorithms of intelligence, but are represented in such a format that we have no idea what’s going on at all. That is much safer because now it will be harder to understand the algorithms of intelligence, making it harder to improve and use. Also if an AI would look at itself as a neural network, it would be at least a bit harder for it to figure out how to recursively self-improve.”
Obviously, alignment is a really hard problem and it is actually very helpful to understand what is going on in your system at the algorithmic level in order to figure out what’s wrong with that specific algorithm. How is it not aligned? And how would we need to change it in order to make it aligned? At least, that’s what I expect. I think not using an approach where the system is interpretable hurts alignment more than capabilities. People have been steadily making progress at making our systems more capable and not understanding them at all, in terms of what algorithms they run inside, doesn’t seem to be much of an issue there, however for alignment that’s a huge issue.