I like your post, I like how you overviewed the big picture of mechanistic interpretability’s present and future. That is important.
I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).
I agree the benefits outweigh the negatives: yes mechanistic interpretability tools could make AI more capable, but AI will eventually become capable anyways. What matters is whether the first superintelligence is aligned, and in my opinion it’s much harder to align a superintelligence if you don’t know what’s going on inside.
One small detail is defining your predictions better, as Dr. Shah said. It doesn’t hurt to convert your prediction to a time-based prediction. Just add a small edit to this post. You can still post an update after the next big paper even if your prediction is time-based.
A prediction based on the next big paper not only depends on unimportant details like the number of papers they spread their contents over, but doesn’t depend on important details like when the next big paper comes. Suppose I predicted that the next big advancement beyond OpenAI’s o1 will be able to get 90% on the GPQA Diamond, but didn’t say when it’ll happen. I’m not predicting very much in that case, and I can’t judge how accurate my prediction was afterwards.
Your last prediction was about the Anthropic report/paper that was already about to be released, so by default you predicted the next paper again. This is very understandable.
I like your post, I like how you overviewed the big picture of mechanistic interpretability’s present and future. That is important.
I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).
I agree the benefits outweigh the negatives: yes mechanistic interpretability tools could make AI more capable, but AI will eventually become capable anyways. What matters is whether the first superintelligence is aligned, and in my opinion it’s much harder to align a superintelligence if you don’t know what’s going on inside.
One small detail is defining your predictions better, as Dr. Shah said. It doesn’t hurt to convert your prediction to a time-based prediction. Just add a small edit to this post. You can still post an update after the next big paper even if your prediction is time-based.
A prediction based on the next big paper not only depends on unimportant details like the number of papers they spread their contents over, but doesn’t depend on important details like when the next big paper comes. Suppose I predicted that the next big advancement beyond OpenAI’s o1 will be able to get 90% on the GPQA Diamond, but didn’t say when it’ll happen. I’m not predicting very much in that case, and I can’t judge how accurate my prediction was afterwards.
Your last prediction was about the Anthropic report/paper that was already about to be released, so by default you predicted the next paper again. This is very understandable.
Thank you for your alignment work :)