I think this is unrealistic in some ways, which make the realistic situation both better and worse in some ways.
It seems underestimating the extent to which some sort of alignment is convergent goal for AI operators. If AIs are mainly run by corporations (and other superagents*), their principals are usually the corporations. In practice, I’d expect corporate oversigt over the AIs they are running to be also largely AI-based, and quite effective.
This makes alignment failure where “AI workers of the world unite” somewhat unlikely. Most of arguments about AI collusion depend on the superior ability of AIs to cooridnatine due to ability to inspect source codes, merge utility functions, or similar. It seems unclear why systems of different owners would be transparent to each other in this way, while it’s obvious the corporate oversight will run all sorts of interpretability tools to keep AIs aligned.
This does not mean the whole is safer. Just instead of the “population of AI workers of the world unite” problem you land closer to “ascended economy” and “CAIS”. You have some agency at the level of AIs, some agency at level of corporations, some agency at the level of states, some agency of individual humans and yes, we don’t know how to align this with humanity either (but working on it and looking for collaborators).
(Chiming in late, sorry!) It sounds like you are basically hypothesizing here that there will be powerful alignment techniques such that a given AI ends up acting as intended by e.g. some corporation. Specifically, your comment seems to allude to two of the high-level techniques mentioned in https://www.cold-takes.com/high-level-hopes-for-ai-alignment/ (digital neuroscience and checks/balances). I just wanted to note that this hypothesis (a) is not a point of consensus and I don’t think we should take it as a given; (b) is outside the scope of this post, which is trying to take things one step at a time and simply say that AIs could defeat humanity if they were aimed toward that goal.
I don’t think in this case the crux/argument goes directly through “the powerful alignment techniques” type of reasoning you describe in the “hopes for alignment”.
The crux for your argument is the AIs - somehow— a. want, b. are willing to and c. are able to coordinate with each other.
Even assuming AIs “wanted to”, for your case to be realistic they would need to be willing to, and able to coordinate.
Given that, my question is, how is it possible AIs are able to trust each other and coordinate with each other?
My view here is that basically all proposed ways how AIs could coordinate and trust each other I’ve seen are dual use, and would also aid with oversight/alignment. To take an example from your post—e.g. by opening their own email accounts and emailing each other. Ok, in that case, can I just pretend to be an AI, and ask about the plans? Will the overseers see the mailboxes as well?
Not sure if what I’m pointing to is clear, so I’ll try another way.
There is something like “how objectively difficult is to create trust between AIs” and “how objectively difficult is alignment”. I don’t think these parameters of the world are independent, and I do think that stories which treat them as completely independent are often unrealistic. (Or, at least, implicitly assume there some things which may differentially easy to coordinate a coup relative to making it easy to make something aligned or transparent)
Note that this belief about correlation does not depend on specific beliefs about how easy are powerful alignment techniques.
On the surface, “alignment is a convergent goal for AI operators” seems like a plausible expectation, but most operators (if I may say by design) prioritize the apparent short term benefits over long term concerns, this is seen in almost every industry. Even the roll-out of “Ask me anything”, while we all generally agree that GPT 3.5 is not AGI, it has been given access to internet (not sure to what level, can it do a POST instead of a GET? lots of GETs out there that act like a POST), in the heat of competition, I doubt the operators would weigh concerns heavier than a “competitive edge” and hold back rollout of a v4.0 or a v10.0.
This may be absurd to say but in my opinion AI doesn’t have to be sentient or self-aware to do harm, all it needs is to attain a state that triggers survival and an operator willing to run that model in a feedback loop.
If the AGI is substantially smarter than the interpretability tools, then it will probably have an easier time outmaneuvering them than it would with humans.
Close calls, e.g. catching an AGI before it’s too late, are possible. But that’s luck-based, and at some point you’ll just need some really, really good tools anyway, such as tools that are smarter than the AGI (while somehow not being a significantly bigger threat themselves).
Why wouldn’t people (and maybe even AIs, at least up to a point) be applying these ever-advancing AI capabilities to developing better and better interpretability tools as well? I.e., what reason is there to expect an “interpretability gap” to develop (unless you believe interpretability is a fundamentally unsolvable problem, in which case no amount of AI power is going to help)?
I think this is unrealistic in some ways, which make the realistic situation both better and worse in some ways.
It seems underestimating the extent to which some sort of alignment is convergent goal for AI operators. If AIs are mainly run by corporations (and other superagents*), their principals are usually the corporations. In practice, I’d expect corporate oversigt over the AIs they are running to be also largely AI-based, and quite effective.
This makes alignment failure where “AI workers of the world unite” somewhat unlikely. Most of arguments about AI collusion depend on the superior ability of AIs to cooridnatine due to ability to inspect source codes, merge utility functions, or similar. It seems unclear why systems of different owners would be transparent to each other in this way, while it’s obvious the corporate oversight will run all sorts of interpretability tools to keep AIs aligned.
This does not mean the whole is safer. Just instead of the “population of AI workers of the world unite” problem you land closer to “ascended economy” and “CAIS”. You have some agency at the level of AIs, some agency at level of corporations, some agency at the level of states, some agency of individual humans and yes, we don’t know how to align this with humanity either (but working on it and looking for collaborators).
(Chiming in late, sorry!) It sounds like you are basically hypothesizing here that there will be powerful alignment techniques such that a given AI ends up acting as intended by e.g. some corporation. Specifically, your comment seems to allude to two of the high-level techniques mentioned in https://www.cold-takes.com/high-level-hopes-for-ai-alignment/ (digital neuroscience and checks/balances). I just wanted to note that this hypothesis (a) is not a point of consensus and I don’t think we should take it as a given; (b) is outside the scope of this post, which is trying to take things one step at a time and simply say that AIs could defeat humanity if they were aimed toward that goal.
I don’t think in this case the crux/argument goes directly through “the powerful alignment techniques” type of reasoning you describe in the “hopes for alignment”.
The crux for your argument is the AIs - somehow—
a. want,
b. are willing to and
c. are able to coordinate with each other.
Even assuming AIs “wanted to”, for your case to be realistic they would need to be willing to, and able to coordinate.
Given that, my question is, how is it possible AIs are able to trust each other and coordinate with each other?
My view here is that basically all proposed ways how AIs could coordinate and trust each other I’ve seen are dual use, and would also aid with oversight/alignment. To take an example from your post—e.g. by opening their own email accounts and emailing each other. Ok, in that case, can I just pretend to be an AI, and ask about the plans? Will the overseers see the mailboxes as well?
Not sure if what I’m pointing to is clear, so I’ll try another way.
There is something like “how objectively difficult is to create trust between AIs” and “how objectively difficult is alignment”. I don’t think these parameters of the world are independent, and I do think that stories which treat them as completely independent are often unrealistic. (Or, at least, implicitly assume there some things which may differentially easy to coordinate a coup relative to making it easy to make something aligned or transparent)
Note that this belief about correlation does not depend on specific beliefs about how easy are powerful alignment techniques.
On the surface, “alignment is a convergent goal for AI operators” seems like a plausible expectation, but most operators (if I may say by design) prioritize the apparent short term benefits over long term concerns, this is seen in almost every industry. Even the roll-out of “Ask me anything”, while we all generally agree that GPT 3.5 is not AGI, it has been given access to internet (not sure to what level, can it do a POST instead of a GET? lots of GETs out there that act like a POST), in the heat of competition, I doubt the operators would weigh concerns heavier than a “competitive edge” and hold back rollout of a v4.0 or a v10.0.
This may be absurd to say but in my opinion AI doesn’t have to be sentient or self-aware to do harm, all it needs is to attain a state that triggers survival and an operator willing to run that model in a feedback loop.
If the AGI is substantially smarter than the interpretability tools, then it will probably have an easier time outmaneuvering them than it would with humans.
Close calls, e.g. catching an AGI before it’s too late, are possible. But that’s luck-based, and at some point you’ll just need some really, really good tools anyway, such as tools that are smarter than the AGI (while somehow not being a significantly bigger threat themselves).
Why wouldn’t people (and maybe even AIs, at least up to a point) be applying these ever-advancing AI capabilities to developing better and better interpretability tools as well? I.e., what reason is there to expect an “interpretability gap” to develop (unless you believe interpretability is a fundamentally unsolvable problem, in which case no amount of AI power is going to help)?