Here’s a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don’t need human to human alignment to make modern capitalism be good and can rely on selfishness instead.
The reason is because there’s a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.
The reason this matters is that once labor is much easier to scale than capital, as is likely in an AI future, it’s now economically viable or even beneficial to break a lot of the rules that help humans survive, contra Matthew Barnett’s view, and this is even more incentivized by the fact that an unaligned AI released into society would likely not be punishable/incentivizable by mere humans, solely due to controlling robotic armies and robotic workforces that allow it to dispense with societal constraints humans have to accept.
dr_s talks about the equilibrium that is totally valid under AI automation economics that is very bad for humans, and avoiding these sorts of equilibriums can’t be done through economic forces, because of the fact that the companies doing this are too powerful to have any real incentives work on them, since they can either neutralize or turn the attempted boycott/shopping around to their own benefit, and thus avoiding this outcome requires alignment to your values, and can’t work with selfishness:
Consider a scenario in which AGI and human-equivalent robotics are developed and end up owned (via e.g. controlling exclusively the infrastructure that runs it, and being closed source) by a group of, say, 10,000 people overall who have some share in this automation capital. If these people have exclusive access to it, a perfectly functional equilibrium is “they trade among peers goods produced by their automated workers and leave everyone else to fend for themselves”.
This framing of the alignment problem, of how to get an AI that values humans such that this outcome is prevented, also has an important implication:
It’s not enough to solve the technical problem of alignment, absent modeling the social situation, because of suffering risk issues plaus catastrophic risk issues, and also means the level of alignment of AI needs to be closer to the fictional benevolent angels than it is to humans in relationship to other humans, so it motivates a more ambitious version of the alignment objectives than making AIs merely not break the law or steal from humans..
I’m actually reasonably hopeful the more ambitious versions of alignment are possible, and think there’s a realistic chance we can actually do them.
But we actually need to do the work, and AI that automates everything might come in your lifetime, so we should prepare the foundations soon.
I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
Here’s a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don’t need human to human alignment to make modern capitalism be good and can rely on selfishness instead.
The reason is because there’s a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.
The reason this matters is that once labor is much easier to scale than capital, as is likely in an AI future, it’s now economically viable or even beneficial to break a lot of the rules that help humans survive, contra Matthew Barnett’s view, and this is even more incentivized by the fact that an unaligned AI released into society would likely not be punishable/incentivizable by mere humans, solely due to controlling robotic armies and robotic workforces that allow it to dispense with societal constraints humans have to accept.
dr_s talks about the equilibrium that is totally valid under AI automation economics that is very bad for humans, and avoiding these sorts of equilibriums can’t be done through economic forces, because of the fact that the companies doing this are too powerful to have any real incentives work on them, since they can either neutralize or turn the attempted boycott/shopping around to their own benefit, and thus avoiding this outcome requires alignment to your values, and can’t work with selfishness:
This framing of the alignment problem, of how to get an AI that values humans such that this outcome is prevented, also has an important implication:
It’s not enough to solve the technical problem of alignment, absent modeling the social situation, because of suffering risk issues plaus catastrophic risk issues, and also means the level of alignment of AI needs to be closer to the fictional benevolent angels than it is to humans in relationship to other humans, so it motivates a more ambitious version of the alignment objectives than making AIs merely not break the law or steal from humans..
I’m actually reasonably hopeful the more ambitious versions of alignment are possible, and think there’s a realistic chance we can actually do them.
But we actually need to do the work, and AI that automates everything might come in your lifetime, so we should prepare the foundations soon.
This also BTW explains why we cannot rely on economic arguments on AI to make the future go well.
I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
See Seth Herd’s comment below for a perspective:
https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1?commentId=cakcEJu389j7Epgqt
Yep, Seth has really clearly outlined the strategy and now I can see what I missed on the first reading. Thanks to both of you!