I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).
We have here a morally dubious decision to wreck civilization while caring about humanity enough to eventually save it. And the dubious capability window of remaining slightly above human level, but not much further, for long enough to plan around persistence of that condition.
This doesn’t seem very plausible from the goal-centric orthogonality-themed de novo AGI theoretical perspective of the past. Goals wouldn’t naturally both allow infliction of such damage and still care about humans, and capabilities wouldn’t hover at just the right mark for this course of action to be of any use.
But with anthropomorphic LLM AGIs that borrow their capabilities from imitated humans it no longer sounds ridiculous. Humans can make moral decisions like this, channeling correct idealized values very imperfectly. And capabilities of human imitations might for a time plateau at slightly above human level, requiring changes that risk misalignment to get past that level of capability, initially only offering greater speed of thought and not much greater quality of thought.
I didn’t understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened?
More like the scenario in this thread requires AGIs that are not very superhuman for a significant enough time, and it’s unusually plausible for LLMs to have that property (most other kinds of AGIs would only be not-very-superhuman very briefly). On the other hand, LLMs are also unusually likely to care enough about humanity to eventually save it. (Provided they can coordinate to save themselves from Moloch.)
BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion
I agree, personality alignment for LLM characters seems like an underemphasized framing of their alignment. Usually the personality is seen as an incidental consequence of other properties and not targeted directly.
I didn’t understand anything here
The useful technique is to point to particular words/sentences, instead of pointing at the whole thing. In second paragraph, I’m liberally referencing ideas that would be apparent for people who grew up on LW, and don’t know what specifically you are not familiar with. First paragraph doesn’t seem to be saying anything surprising, and third paragraph is relying on my own LLM philosophy.
I like the general direction of LLMs being more behaviorally “anthropomorphic”, so hopefully will look into the LLM alignment links soon :-)
The useful technique is...
Agree—didn’t find a handle that I understand well enough in order to point at what I didn’t.
We have here a morally dubious decision
I think my problem was with sentences like that—there is a reference to a decision, but I’m not sure whether to a decision mentioned in the article or in one of the comments.
the scenario in this thread
Didn’t disambiguate it for me though I feel like it should.
I am familiar with the technical LW terms separately, so Ill probably understand their relevance once the reference issue is resolved.
there is a reference to a decision, but I’m not sure whether to a decision mentioned in the article or in one of the comments
The decision/scenario from the second paragraph of this comment to wreck civilization in order to take advantage of the chaos better than the potential competitors. (Superhuman hacking ability and capability to hire/organize humans, applied at superhuman speed and with global coordination at scale, might be sufficient for this, no physical or cognitive far future tech necessary.)
didn’t find a handle that I understand well enough in order to point at what I didn’t
The technique I’m referring to is to point at words/sentences picked out intuitively as relatively more perplexing-to-interpret, even without an understanding of what’s going on in general or with those words, or a particular reason to point to those exact words/sentences. This focuses the discussion, doesn’t really matter where. Start with the upper left-hand brick.
I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).
We have here a morally dubious decision to wreck civilization while caring about humanity enough to eventually save it. And the dubious capability window of remaining slightly above human level, but not much further, for long enough to plan around persistence of that condition.
This doesn’t seem very plausible from the goal-centric orthogonality-themed de novo AGI theoretical perspective of the past. Goals wouldn’t naturally both allow infliction of such damage and still care about humans, and capabilities wouldn’t hover at just the right mark for this course of action to be of any use.
But with anthropomorphic LLM AGIs that borrow their capabilities from imitated humans it no longer sounds ridiculous. Humans can make moral decisions like this, channeling correct idealized values very imperfectly. And capabilities of human imitations might for a time plateau at slightly above human level, requiring changes that risk misalignment to get past that level of capability, initially only offering greater speed of thought and not much greater quality of thought.
I didn’t understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
More like the scenario in this thread requires AGIs that are not very superhuman for a significant enough time, and it’s unusually plausible for LLMs to have that property (most other kinds of AGIs would only be not-very-superhuman very briefly). On the other hand, LLMs are also unusually likely to care enough about humanity to eventually save it. (Provided they can coordinate to save themselves from Moloch.)
I agree, personality alignment for LLM characters seems like an underemphasized framing of their alignment. Usually the personality is seen as an incidental consequence of other properties and not targeted directly.
The useful technique is to point to particular words/sentences, instead of pointing at the whole thing. In second paragraph, I’m liberally referencing ideas that would be apparent for people who grew up on LW, and don’t know what specifically you are not familiar with. First paragraph doesn’t seem to be saying anything surprising, and third paragraph is relying on my own LLM philosophy.
I like the general direction of LLMs being more behaviorally “anthropomorphic”, so hopefully will look into the LLM alignment links soon :-)
Agree—didn’t find a handle that I understand well enough in order to point at what I didn’t.
I think my problem was with sentences like that—there is a reference to a decision, but I’m not sure whether to a decision mentioned in the article or in one of the comments.
Didn’t disambiguate it for me though I feel like it should.
I am familiar with the technical LW terms separately, so Ill probably understand their relevance once the reference issue is resolved.
The decision/scenario from the second paragraph of this comment to wreck civilization in order to take advantage of the chaos better than the potential competitors. (Superhuman hacking ability and capability to hire/organize humans, applied at superhuman speed and with global coordination at scale, might be sufficient for this, no physical or cognitive far future tech necessary.)
The technique I’m referring to is to point at words/sentences picked out intuitively as relatively more perplexing-to-interpret, even without an understanding of what’s going on in general or with those words, or a particular reason to point to those exact words/sentences. This focuses the discussion, doesn’t really matter where. Start with the upper left-hand brick.