we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4’s accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability… we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system—one that has been shaped by its own particular set of pressures.
I really like Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. A bunch of concrete oddities and quirks of GPT-4, understood via several qualitative hypotheses about the typicality of target inputs/outputs: