Where can I find a post or article arguing that the internal cognitive model of contemporary LLMs is quite alien, strange, non-human, even though they are trained on human text and produce human-like answers, which are rendered “friendly” by RLHF?
To be clear, I am not asking about the following, which I am familiar with:
Eliezer Yudkowsky’s description of evolution as Azathoth, the blind idiot god, as a way of showing that “intelligences” can be quite incomprehensible
The difference in environments between the training and the runtime phase of an LLM
The fact that machine-learning systems like LLMs are not really neuromorphic; they are structured differently from human brains (though that fact does not exclude the possibility of similarly on a logical level)
Rather, I am looking for a discussion of evidence that the LLMs internal “true” motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently. A good argument might analyze bits of weird inhuman behavior to try to infer the internal model.
(All I found on the shoggoth idea on LessWrong is this article contrasts the idea of the shoggoth with the idea that there is no coherent model, but does not explain why we might think that there is an alien cognitive model. This one likewise mentions the idea but does not argue for its correctness.)
[Edit: Another user corrected my spelling: shoggoth, not shuggoth.]
What is the best argument that LLMs are shoggoths?
Where can I find a post or article arguing that the internal cognitive model of contemporary LLMs is quite alien, strange, non-human, even though they are trained on human text and produce human-like answers, which are rendered “friendly” by RLHF?
To be clear, I am not asking about the following, which I am familiar with:
The original of the shoggoth meme and its relation to H.P. Lovecraft’s shoggoth
The notion that the space of possible minds is very large, with humanity only a small part
Eliezer Yudkowsky’s description of evolution as Azathoth, the blind idiot god, as a way of showing that “intelligences” can be quite incomprehensible
The difference in environments between the training and the runtime phase of an LLM
The fact that machine-learning systems like LLMs are not really neuromorphic; they are structured differently from human brains (though that fact does not exclude the possibility of similarly on a logical level)
Rather, I am looking for a discussion of evidence that the LLMs internal “true” motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently. A good argument might analyze bits of weird inhuman behavior to try to infer the internal model.
(All I found on the shoggoth idea on LessWrong is this article contrasts the idea of the shoggoth with the idea that there is no coherent model, but does not explain why we might think that there is an alien cognitive model. This one likewise mentions the idea but does not argue for its correctness.)
[Edit: Another user corrected my spelling: shoggoth, not shuggoth.]