Since o1 I’ve been thinking that faithful chain-of-thought is waaaay underinvested in as a research direction.
If we get models such that a forward pass is kinda dumb, CoT is superhuman, and CoT is faithful and legible, then we can all go home, right? Loss of control is not gonna be a problem.
And it feels plausibly tractable.
I might go so far as to say it Pareto dominates most people’s agendas on importance and tractability. While being pretty neglected.
Even if you think a life’s work can’t make a difference but many can, you can still think it’s worthwhile to work on alignment for whatever reasons make you think it’s worthwhile to do things like voting.
(E.g. a non-CDT decision theory)