I agree that LLM-based models are a way to go due to their alignment-related advantages. Their ability to capture the complexity of human values makes alignment tractable which is a huge improvement to previous status quo. Add here the fact that they are interpretable by design, as all the high level reasoning will be going in natural language and now we are having a very solid bet for FAI.
I think we should focus a lot more effort on investigating this direction. Not in a sense of training more and more capable language models—that’s actually one of the few really dangerous cases where misalignment still can happen and we wouldn’t even know until it’s too late. But in a sense of figuring out the right way to build chains of thoughts for language agents with a LLM of a fixed complexity.
I agree that LLM-based models are a way to go due to their alignment-related advantages. Their ability to capture the complexity of human values makes alignment tractable which is a huge improvement to previous status quo. Add here the fact that they are interpretable by design, as all the high level reasoning will be going in natural language and now we are having a very solid bet for FAI.
I think we should focus a lot more effort on investigating this direction. Not in a sense of training more and more capable language models—that’s actually one of the few really dangerous cases where misalignment still can happen and we wouldn’t even know until it’s too late. But in a sense of figuring out the right way to build chains of thoughts for language agents with a LLM of a fixed complexity.