Rather, I am looking for a discussion of evidence that the LLMs internal “true” motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently. A good argument might analyze bits of weird inhuman behavior to try to infer the internal model.
I think we do not understand enough about either the LLM’s true algorithms or humans’ to make such arguments, except for basic observations like the fact that humans have non-language recurrent state which many LLMs lack.
I think we do not understand enough about either the LLM’s true algorithms or humans’ to make such arguments, except for basic observations like the fact that humans have non-language recurrent state which many LLMs lack.