I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
I’ve asked this question to others, but would like to know your perspective (because our conversations with you have been genuinely illuminating for me). I’d be really interested in knowing your views on more of a control-by-power-hungry-humans side of AI risk.
For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don’t think we could trust any of the current leading AI labs to use that power fairly. I don’t think this lab would voluntarily decide to give up control over AGI either (intuitively, it would take quite something for anyone to give up such a source of power). Can this issue be somehow resolved by humans? Are there any plans (or at least hopeful plans) for such a scenario to prevent really bad outcomes?
This is my main risk scenario nowadays, though I don’t really like calling it an existential risk, because the billionaires can survive and spread across the universe, so some humans would survive.
The solution to this problem is fundamentally political, and probably requires massive reforms of both the government and the economy that I don’t know yet.
I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
See Seth Herd’s comment below for a perspective:
https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1?commentId=cakcEJu389j7Epgqt
I’ve asked this question to others, but would like to know your perspective (because our conversations with you have been genuinely illuminating for me). I’d be really interested in knowing your views on more of a control-by-power-hungry-humans side of AI risk.
For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don’t think we could trust any of the current leading AI labs to use that power fairly. I don’t think this lab would voluntarily decide to give up control over AGI either (intuitively, it would take quite something for anyone to give up such a source of power). Can this issue be somehow resolved by humans? Are there any plans (or at least hopeful plans) for such a scenario to prevent really bad outcomes?
This is my main risk scenario nowadays, though I don’t really like calling it an existential risk, because the billionaires can survive and spread across the universe, so some humans would survive.
The solution to this problem is fundamentally political, and probably requires massive reforms of both the government and the economy that I don’t know yet.
I wish more people worked on this.
Yep, Seth has really clearly outlined the strategy and now I can see what I missed on the first reading. Thanks to both of you!