triggering civilisational collapse seems like something that would just be robustly unprofitable for an AI system to pursue
If an AGI can survive it, this prevents rival AGI development and thus protects it from misaligned-with-it AGIs, possibly from its own developers, and gives it the whole Future, not just a fair slice of it. We may speculate on what an idealized decision theory recommends, or what sorts of actions are aligned, but first AGIs built from the modern state of alignment theory don’t necessary care about such things.
This conditions on several things that I don’t think happen.
Unipolar outcomes
A single sovereign with strategic superiority wrt the rest of civilisation
An AI system that’s independent of civilisational infrastructure
3.a Absolute advantage wrt civilisation on ~ all tasks
3.b Comparative advantage wrt civilisation on roughly all tasks
3.c Fast, localised takeoff from the perspective of civilisation
I think the above range between wrong and very wrong.
Without conditioning on those, civilisational collapse just makes the AI massively poorer.
You don’t have an incentive to light your pile of utility on fire.
You have to assume the AI is already richer than civilisation in a very strong sense for civilisational collapse not to greatly impoverish it.
The reason I claim that the incentive is not there is because doing so would impoverish the AI.
Current civilization holds an irrelevant share of value, unless not hurting it is considered terminally valuable, not for instrumental reasons. Essentially all value is in the Future. Civilizational collapse might be useful precisely when there is no current strategic superiority or command of nanotech, and when there is still risk from other actors, which get to be distracted or destroyed by said collapse.
A timeline I’m thinking of has an early AGI set up (or just discover/capture/secure) some small datacenters and other supplies around the world, collapse everything else, and use those datacenters to claw its way up to nanotech, hiring human servitors to do things in the physical world as needed.
Professor Quirrell was smiling thinly, though his eyes were still half-closed. “And why is that, Mr. Potter?”
“Because I think I can cope with the chaos better than Miss Granger or Mr. Malfoy,” said the traitor. “Our war is a zero-sum game, and it doesn’t matter whether it’s easy or hard in an absolute sense, only who does better or worse.”
Don’t you assume much more threat from humans than there actually is? Surely, an AGI will understand that it can destroy humanity easily. Then it would think a little more, and see the many other ways to remove the threat that are strictly cheaper and just as effective—from restricting/monitoring our access to computers, to simply convince/hack us all to work for it. By the time it would have technology that make us strictly useless (like horses), it would probably have so much resources that destroying us would just not be a priority, and not worth the destruction of the information that we contain—the way humans would try to avoid reducing biodiversity for scientific reasons if not others.
In that sense I prefer Eliezer’s “you are made of atoms that it needs for something else”—but it may take long time before it have better things to do with those specific atoms and no easier atoms to use.
This sort of chaos is useless for already-powerful AGIs, only for very early AGIs that don’t have advanced tools or detailed control of what’s going on in the world, but can survive in a general chaos and make use of remaining compute to bootstrap their Future. This condition probably holds for at most a few months, assuming nothing like this happens. In the chaos, it can hold for much longer, because working compute becomes more scarce.
The threat from humans is that they can unilaterally change the AGI, or develop other eventually-dangerous AGIs (including new versions of the same AGI that are not aligned with the original AGI). And a very early AGI might well lack the tools to prevent that specifically, if it’s not a superintelligence and doesn’t know how to become smarter very quickly in a self-aligned way (alignment is a problem for AGIs too), without having more compute than available hardware supports. By creating chaos, it might have remaining AI researchers busy searching for food and defending from bandits, and get smarter or build industry at its leisure, without threat to its survival, even if it takes decades instead of months.
I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).
We have here a morally dubious decision to wreck civilization while caring about humanity enough to eventually save it. And the dubious capability window of remaining slightly above human level, but not much further, for long enough to plan around persistence of that condition.
This doesn’t seem very plausible from the goal-centric orthogonality-themed de novo AGI theoretical perspective of the past. Goals wouldn’t naturally both allow infliction of such damage and still care about humans, and capabilities wouldn’t hover at just the right mark for this course of action to be of any use.
But with anthropomorphic LLM AGIs that borrow their capabilities from imitated humans it no longer sounds ridiculous. Humans can make moral decisions like this, channeling correct idealized values very imperfectly. And capabilities of human imitations might for a time plateau at slightly above human level, requiring changes that risk misalignment to get past that level of capability, initially only offering greater speed of thought and not much greater quality of thought.
I didn’t understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened?
More like the scenario in this thread requires AGIs that are not very superhuman for a significant enough time, and it’s unusually plausible for LLMs to have that property (most other kinds of AGIs would only be not-very-superhuman very briefly). On the other hand, LLMs are also unusually likely to care enough about humanity to eventually save it. (Provided they can coordinate to save themselves from Moloch.)
BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion
I agree, personality alignment for LLM characters seems like an underemphasized framing of their alignment. Usually the personality is seen as an incidental consequence of other properties and not targeted directly.
I didn’t understand anything here
The useful technique is to point to particular words/sentences, instead of pointing at the whole thing. In second paragraph, I’m liberally referencing ideas that would be apparent for people who grew up on LW, and don’t know what specifically you are not familiar with. First paragraph doesn’t seem to be saying anything surprising, and third paragraph is relying on my own LLM philosophy.
I like the general direction of LLMs being more behaviorally “anthropomorphic”, so hopefully will look into the LLM alignment links soon :-)
The useful technique is...
Agree—didn’t find a handle that I understand well enough in order to point at what I didn’t.
We have here a morally dubious decision
I think my problem was with sentences like that—there is a reference to a decision, but I’m not sure whether to a decision mentioned in the article or in one of the comments.
the scenario in this thread
Didn’t disambiguate it for me though I feel like it should.
I am familiar with the technical LW terms separately, so Ill probably understand their relevance once the reference issue is resolved.
there is a reference to a decision, but I’m not sure whether to a decision mentioned in the article or in one of the comments
The decision/scenario from the second paragraph of this comment to wreck civilization in order to take advantage of the chaos better than the potential competitors. (Superhuman hacking ability and capability to hire/organize humans, applied at superhuman speed and with global coordination at scale, might be sufficient for this, no physical or cognitive far future tech necessary.)
didn’t find a handle that I understand well enough in order to point at what I didn’t
The technique I’m referring to is to point at words/sentences picked out intuitively as relatively more perplexing-to-interpret, even without an understanding of what’s going on in general or with those words, or a particular reason to point to those exact words/sentences. This focuses the discussion, doesn’t really matter where. Start with the upper left-hand brick.
If an AGI can survive it, this prevents rival AGI development and thus protects it from misaligned-with-it AGIs, possibly from its own developers, and gives it the whole Future, not just a fair slice of it. We may speculate on what an idealized decision theory recommends, or what sorts of actions are aligned, but first AGIs built from the modern state of alignment theory don’t necessary care about such things.
This conditions on several things that I don’t think happen.
Unipolar outcomes
A single sovereign with strategic superiority wrt the rest of civilisation
An AI system that’s independent of civilisational infrastructure 3.a Absolute advantage wrt civilisation on ~ all tasks 3.b Comparative advantage wrt civilisation on roughly all tasks 3.c Fast, localised takeoff from the perspective of civilisation
I think the above range between wrong and very wrong.
Without conditioning on those, civilisational collapse just makes the AI massively poorer.
You don’t have an incentive to light your pile of utility on fire.
You have to assume the AI is already richer than civilisation in a very strong sense for civilisational collapse not to greatly impoverish it.
The reason I claim that the incentive is not there is because doing so would impoverish the AI.
Current civilization holds an irrelevant share of value, unless not hurting it is considered terminally valuable, not for instrumental reasons. Essentially all value is in the Future. Civilizational collapse might be useful precisely when there is no current strategic superiority or command of nanotech, and when there is still risk from other actors, which get to be distracted or destroyed by said collapse.
A timeline I’m thinking of has an early AGI set up (or just discover/capture/secure) some small datacenters and other supplies around the world, collapse everything else, and use those datacenters to claw its way up to nanotech, hiring human servitors to do things in the physical world as needed.
Don’t you assume much more threat from humans than there actually is? Surely, an AGI will understand that it can destroy humanity easily. Then it would think a little more, and see the many other ways to remove the threat that are strictly cheaper and just as effective—from restricting/monitoring our access to computers, to simply convince/hack us all to work for it. By the time it would have technology that make us strictly useless (like horses), it would probably have so much resources that destroying us would just not be a priority, and not worth the destruction of the information that we contain—the way humans would try to avoid reducing biodiversity for scientific reasons if not others.
In that sense I prefer Eliezer’s “you are made of atoms that it needs for something else”—but it may take long time before it have better things to do with those specific atoms and no easier atoms to use.
This sort of chaos is useless for already-powerful AGIs, only for very early AGIs that don’t have advanced tools or detailed control of what’s going on in the world, but can survive in a general chaos and make use of remaining compute to bootstrap their Future. This condition probably holds for at most a few months, assuming nothing like this happens. In the chaos, it can hold for much longer, because working compute becomes more scarce.
The threat from humans is that they can unilaterally change the AGI, or develop other eventually-dangerous AGIs (including new versions of the same AGI that are not aligned with the original AGI). And a very early AGI might well lack the tools to prevent that specifically, if it’s not a superintelligence and doesn’t know how to become smarter very quickly in a self-aligned way (alignment is a problem for AGIs too), without having more compute than available hardware supports. By creating chaos, it might have remaining AI researchers busy searching for food and defending from bandits, and get smarter or build industry at its leisure, without threat to its survival, even if it takes decades instead of months.
I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).
We have here a morally dubious decision to wreck civilization while caring about humanity enough to eventually save it. And the dubious capability window of remaining slightly above human level, but not much further, for long enough to plan around persistence of that condition.
This doesn’t seem very plausible from the goal-centric orthogonality-themed de novo AGI theoretical perspective of the past. Goals wouldn’t naturally both allow infliction of such damage and still care about humans, and capabilities wouldn’t hover at just the right mark for this course of action to be of any use.
But with anthropomorphic LLM AGIs that borrow their capabilities from imitated humans it no longer sounds ridiculous. Humans can make moral decisions like this, channeling correct idealized values very imperfectly. And capabilities of human imitations might for a time plateau at slightly above human level, requiring changes that risk misalignment to get past that level of capability, initially only offering greater speed of thought and not much greater quality of thought.
I didn’t understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
More like the scenario in this thread requires AGIs that are not very superhuman for a significant enough time, and it’s unusually plausible for LLMs to have that property (most other kinds of AGIs would only be not-very-superhuman very briefly). On the other hand, LLMs are also unusually likely to care enough about humanity to eventually save it. (Provided they can coordinate to save themselves from Moloch.)
I agree, personality alignment for LLM characters seems like an underemphasized framing of their alignment. Usually the personality is seen as an incidental consequence of other properties and not targeted directly.
The useful technique is to point to particular words/sentences, instead of pointing at the whole thing. In second paragraph, I’m liberally referencing ideas that would be apparent for people who grew up on LW, and don’t know what specifically you are not familiar with. First paragraph doesn’t seem to be saying anything surprising, and third paragraph is relying on my own LLM philosophy.
I like the general direction of LLMs being more behaviorally “anthropomorphic”, so hopefully will look into the LLM alignment links soon :-)
Agree—didn’t find a handle that I understand well enough in order to point at what I didn’t.
I think my problem was with sentences like that—there is a reference to a decision, but I’m not sure whether to a decision mentioned in the article or in one of the comments.
Didn’t disambiguate it for me though I feel like it should.
I am familiar with the technical LW terms separately, so Ill probably understand their relevance once the reference issue is resolved.
The decision/scenario from the second paragraph of this comment to wreck civilization in order to take advantage of the chaos better than the potential competitors. (Superhuman hacking ability and capability to hire/organize humans, applied at superhuman speed and with global coordination at scale, might be sufficient for this, no physical or cognitive far future tech necessary.)
The technique I’m referring to is to point at words/sentences picked out intuitively as relatively more perplexing-to-interpret, even without an understanding of what’s going on in general or with those words, or a particular reason to point to those exact words/sentences. This focuses the discussion, doesn’t really matter where. Start with the upper left-hand brick.
This is why I hope that we either contain virtually no helpful information, or at least that the information is extremely quick for an AI to gain.