I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
See Seth Herd’s comment below for a perspective:
https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1?commentId=cakcEJu389j7Epgqt
Yep, Seth has really clearly outlined the strategy and now I can see what I missed on the first reading. Thanks to both of you!