More generally, deceptive alignment is likely to bite, and TurnTrout seems to handwave it away. There are other problems, but this is why I’m unimpressed by his claims about shard theory.
It’s possibly even worse than HCH, conditional on it being outer alignment at optimum.
I have the view that we need to build an archway of techniques to solve this problem. Each block in the arch is itself insufficient. You must have a scaffold in place while building the arch to keep the half-constructed edifice from falling. In my view that scaffold is the temporary patch of ‘boxing’. The pieces of the arch which must be put together while the scaffold is in place: mechanistic interpretability, abstract interpretability, HCH, Shard theory experimentation leading to direct shard measurement and editing, replicating studying and learning from compassion circuits in the brain in the context of brain-like models, toy models of deceptive alignment, red teaming of model behavior under the influence of malign human actors, robustness / stability under antagonistic optimization pressure, the nature of the implicit priors of the machine learning techniques we use, etc.
I don’t think any single technique can be guaranteed to get us there at this point. I think what is needed is more knowledge, more understanding. I think we need to get that through collecting empirical data. Lots of empirical data. And then thinking carefully about the data and coming up with hypotheses to explain it, and then testing those.
I don’t think criticizing individual blocks of the arch for not already being the entire arch is particularly useful.
I disagree with John’s post in a similar way to how Steven Byrnes disagrees in the comments. It’s not the speed of takeoff that matters, it’s our loss of control. If the takeoff happens very fast, but we have an automatic “turn it off if it gets too smart” system in place that successfully turns it off, and then we test it in a highly impaired mode (lowered intelligence/functionality, lowered speed) to learn about it… this is potentially a win not a loss.
As for John W’s point ‘getting what you measure’, yes. That’s the hard task interpretability must conquer. I think it is possible to hill climb towards getting better at this so long as you are in control and able to run many separate experiments.
Worlds Where Iterative Design Fails
More generally, deceptive alignment is likely to bite, and TurnTrout seems to handwave it away. There are other problems, but this is why I’m unimpressed by his claims about shard theory.
It’s possibly even worse than HCH, conditional on it being outer alignment at optimum.
I have the view that we need to build an archway of techniques to solve this problem. Each block in the arch is itself insufficient. You must have a scaffold in place while building the arch to keep the half-constructed edifice from falling. In my view that scaffold is the temporary patch of ‘boxing’. The pieces of the arch which must be put together while the scaffold is in place: mechanistic interpretability, abstract interpretability, HCH, Shard theory experimentation leading to direct shard measurement and editing, replicating studying and learning from compassion circuits in the brain in the context of brain-like models, toy models of deceptive alignment, red teaming of model behavior under the influence of malign human actors, robustness / stability under antagonistic optimization pressure, the nature of the implicit priors of the machine learning techniques we use, etc.
I don’t think any single technique can be guaranteed to get us there at this point. I think what is needed is more knowledge, more understanding. I think we need to get that through collecting empirical data. Lots of empirical data. And then thinking carefully about the data and coming up with hypotheses to explain it, and then testing those.
I don’t think criticizing individual blocks of the arch for not already being the entire arch is particularly useful.
Yes, but TurnTrout seems to want to go from shard theory being useful to shard theory being the solution, which leaves me worried.
https://www.lesswrong.com/posts/z8s3bsw3WY9fdevSm/boxing-an-ai ;-)
https://www.lesswrong.com/posts/wgcFStYwacRB8y3Yp/timelines-are-relevant-to-alignment-research-timelines-2-of
https://www.lesswrong.com/posts/p62bkNAciLsv6WFnR/how-do-we-align-an-agi-without-getting-socially-engineered
I disagree with John’s post in a similar way to how Steven Byrnes disagrees in the comments. It’s not the speed of takeoff that matters, it’s our loss of control. If the takeoff happens very fast, but we have an automatic “turn it off if it gets too smart” system in place that successfully turns it off, and then we test it in a highly impaired mode (lowered intelligence/functionality, lowered speed) to learn about it… this is potentially a win not a loss.
As for John W’s point ‘getting what you measure’, yes. That’s the hard task interpretability must conquer. I think it is possible to hill climb towards getting better at this so long as you are in control and able to run many separate experiments.