Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.
For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.
How do you want AI capabilities to be advanced?
Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.
For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.