my intuition about language models is that despite communicating in language, they learn and think in pretty alien ways.
Do you think you can elaborate on why you think this? Recent interpretability work such as Locating and Editing Factual Knowledge in GPT and In-context Learning and Induction Heads has generally made me think language models have more interpretable/understandable internal computations than I’d initially assumed.
Do you think you can elaborate on why you think this? Recent interpretability work such as Locating and Editing Factual Knowledge in GPT and In-context Learning and Induction Heads has generally made me think language models have more interpretable/understandable internal computations than I’d initially assumed.