I have the view that we need to build an archway of techniques to solve this problem. Each block in the arch is itself insufficient. You must have a scaffold in place while building the arch to keep the half-constructed edifice from falling. In my view that scaffold is the temporary patch of ‘boxing’. The pieces of the arch which must be put together while the scaffold is in place: mechanistic interpretability, abstract interpretability, HCH, Shard theory experimentation leading to direct shard measurement and editing, replicating studying and learning from compassion circuits in the brain in the context of brain-like models, toy models of deceptive alignment, red teaming of model behavior under the influence of malign human actors, robustness / stability under antagonistic optimization pressure, the nature of the implicit priors of the machine learning techniques we use, etc.
I don’t think any single technique can be guaranteed to get us there at this point. I think what is needed is more knowledge, more understanding. I think we need to get that through collecting empirical data. Lots of empirical data. And then thinking carefully about the data and coming up with hypotheses to explain it, and then testing those.
I don’t think criticizing individual blocks of the arch for not already being the entire arch is particularly useful.
I have the view that we need to build an archway of techniques to solve this problem. Each block in the arch is itself insufficient. You must have a scaffold in place while building the arch to keep the half-constructed edifice from falling. In my view that scaffold is the temporary patch of ‘boxing’. The pieces of the arch which must be put together while the scaffold is in place: mechanistic interpretability, abstract interpretability, HCH, Shard theory experimentation leading to direct shard measurement and editing, replicating studying and learning from compassion circuits in the brain in the context of brain-like models, toy models of deceptive alignment, red teaming of model behavior under the influence of malign human actors, robustness / stability under antagonistic optimization pressure, the nature of the implicit priors of the machine learning techniques we use, etc.
I don’t think any single technique can be guaranteed to get us there at this point. I think what is needed is more knowledge, more understanding. I think we need to get that through collecting empirical data. Lots of empirical data. And then thinking carefully about the data and coming up with hypotheses to explain it, and then testing those.
I don’t think criticizing individual blocks of the arch for not already being the entire arch is particularly useful.
Yes, but TurnTrout seems to want to go from shard theory being useful to shard theory being the solution, which leaves me worried.