My thoughts about this:
1. Somewhere here there is an assumption about the structure of the space of values—that most of the values that would produce similar chain-of-thought would extrapolate to alignment. Maybe it is like this, and if it’s not that would mean that probably if we increase the intelligence of some really very good person to superintelligent level it would still have catastrophic consequences to everyone else. And in general without it alignment is probably doomed anyway. I think if we have to make one assumption on basis “if it’s false, we are doomed anyway”, this one is not the worst, but it should be explicitly labeled like this to avoid part ways with reality completely by making a lot of such assumptions and not only one. …actually even if this assumption is true for humans it doesn’t mean it is definitely true for LLMs, because they are not humans.
2. Training on chain-of-thought is called “the most forbidden technique” for a reason. Using chain-of-thought to select what model to expand/upgrade/use its outputs to train other models is not exactly “training on chain-of-thought”, but it’s close. How many bits of selection pressure would it applied? How many bits of selection pressure probably can be applied without making chain-of-thought untrustworthy? Which of these two numbers is greater? How sure are we about it?
Fake windows
The list of frontpage posts and all opened posts appear in fake windows that can be drag-dropped and can obscure each other.