To clarify, when we succeed by a “cognitive interpretability approach”, I guess you mostly mean sth like:
We have a deep and robust enough understanding of minds and cognition that we know it will work, even though we likely have no idea what exactly the AI is thinking.
Whereas I guess many people might think you mean:
We have awesome interpretability or ELK so we can see what the AI is thinking.
Let me rephrase. I think you think:
Understanding thoughts of powerful AI is a lot harder than for human- or subhuman-level AI.
Even if we could translate the thoughts of the AI into language, humans would need a lot of time understanding those concepts and we likely cannot get enough humans to oversee the AI while thinking, because it’d need too much time.
So we’d need to have incredibly good and efficient automated interpretability tools / ELK algorithms, that detect when an AI thinks dangerous thoughts.
However, the ability to detect misalignment doesn’t yet give us a good AI. (E.g. relaxed adversarial training with those interpretability tools won’t give us the changes in the deep underlying cognition that we need, but just add some superficial behavioral patches or so.)
To get an actually good AI, we need to understand how to shape the deep parts of cognition in a way they extrapolate as we want it.
Am I (roughly) correct in that you hold those opinions?
(I do basically hold those opinions, though my model has big uncertainties.)
To clarify, when we succeed by a “cognitive interpretability approach”, I guess you mostly mean sth like:
We have a deep and robust enough understanding of minds and cognition that we know it will work, even though we likely have no idea what exactly the AI is thinking.
Whereas I guess many people might think you mean:
We have awesome interpretability or ELK so we can see what the AI is thinking.
Let me rephrase. I think you think:
Understanding thoughts of powerful AI is a lot harder than for human- or subhuman-level AI.
Even if we could translate the thoughts of the AI into language, humans would need a lot of time understanding those concepts and we likely cannot get enough humans to oversee the AI while thinking, because it’d need too much time.
So we’d need to have incredibly good and efficient automated interpretability tools / ELK algorithms, that detect when an AI thinks dangerous thoughts.
However, the ability to detect misalignment doesn’t yet give us a good AI. (E.g. relaxed adversarial training with those interpretability tools won’t give us the changes in the deep underlying cognition that we need, but just add some superficial behavioral patches or so.)
To get an actually good AI, we need to understand how to shape the deep parts of cognition in a way they extrapolate as we want it.
Am I (roughly) correct in that you hold those opinions?
(I do basically hold those opinions, though my model has big uncertainties.)