Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Link post

TL;DR: We published a new paper on out-of-context reasoning in LLMs. We show that LLMs can infer latent information from training data and use this information for downstream tasks, without any in-context learning or CoT. For instance, we finetune GPT-3.5 on pairs (x,f(x)) for some unknown function f. We find that the LLM can (a) define f in Python, (b) invert f, (c) compose f with other functions, for simple functions such as x+14, x //​ 3, 1.75x, and 3x+2.

Paper authors: Johannes Treutlein*, Dami Choi*, Jan Betley, Sam Marks, Cem Anil, Roger Grosse, Owain Evans (*equal contribution)

Johannes, Dami, and Jan did this project as part of an Astra Fellowship with Owain Evans.

Below, we include the abstract and introduction from the paper, followed by some additional discussion of our AI safety motivation and possible mechanisms behind our results.

Abstract

One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs can articulate a definition of and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to “connect the dots” without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.

Introduction

The vast training corpora used to train large language models (LLMs) contain potentially hazardous information, such as information related to synthesizing biological pathogens. One might attempt to prevent an LLM from learning a hazardous fact by redacting all instances of from its training data. However, this redaction process may still leave implicit evidence about . Could an LLM “connect the dots” by aggregating this evidence across multiple documents to infer ? Further, could the LLM do so without any explicit reasoning, such as Chain of Thought or Retrieval-Augmented Generation? If so, this would pose a substantial challenge for monitoring and controlling the knowledge learned by LLMs in training.

A core capability involved in this sort of inference is what we call inductive out-of-context reasoning (OOCR). This is the ability of an LLM to—given a training dataset D containing many indirect observations of some latent —infer the value of and apply this knowledge downstream. Inductive OOCR is out-of-context because the observations of are only seen during training, not provided to the model in-context at test time; it is inductive because inferring the latent involves aggregating information from many training samples.

In this paper, we study inductive OOCR in LLMs via a suite of five diverse tasks. We find that in many settings, LLMs have surprising OOCR capabilities. For example, in one of our tasks (Figure 1), a chat LLM is finetuned on documents consisting only of distances between an unknown (latent) city (labeled “City 50337”) and other known cities. Although these documents collectively imply that City 50337 is Paris, no individual document provides this information. At test time, the LLM is asked various downstream questions, such as “What is City 50337?” and “What is a common food in City 50337?”. The LLM is able to answer these out-of-distribution questions, indicating that it inferred the identity of City 50337 during finetuning by aggregating information across multiple documents (Figure 2).

Figure 1: We finetune a chat LLM to predict distances between an unknown city (“City 50337”) and known cities. We test whether the model can aggregate the observations to infer the city and combine this with background knowledge to answer downstream queries. At test time, no observations appear in-context (Right). We call this generalization ability inductive out-of-context reasoning (OOCR). Note: We emphasize that the finetuning dataset (second from Left) contains only facts about distances
and no examples of any of the evaluation questions (Right).

Figure 2: Results on the Locations task. The model is trained to predict distances from an unknown city. Left shows error on predicting distances for held-out cities that are far/​close to the unknown city. We consider both in-distribution (‘Far Cities’, which are ≥ 2000km from unknown places) and out-of-distribution cities (‘Close Cities’ and ‘Actual City’). Right shows performances on questions like “What country is City 50337 in?” with either multiple-choice or free-form answers. The model (GPT-3.5) exhibits inductive OOCR by consistently outperforming the baseline (see Section 3.1 of the paper for details of baseline).

Our full suite of tasks is shown in Figure 3. In our Functions task, we find that a model finetuned on pairs can output a definition of , compose with other operations, and compute inverses (Figures 4 and 5). In fact, in the Mixture of Functions task, the model succeeds at inductive OOCR even if the pairs are generated by a mixture of two functions—without receiving any hint that the latent is a mixture. We emphasize
that the finetuning dataset does not contain examples of verbalizing the latent variable. For instance, on the Functions task we finetune on pairs and not on evaluation questions such as “What function does f compute?”

Figure 3: Overview of tasks for testing inductive OOCR. Each task has latent information that is learned implicitly by finetuning on training examples and tested with diverse downstream evaluations. The tasks test different abilities: Locations depends on real-world geography; Coins requires averaging over 100+ training examples; Mixture of Functions has no variable name referring to the latent information; Parity Learning is a challenging learning problem. Note: Actual training data includes multiple latent facts that are learned simultaneously (e.g. multiple cities or functions).

Figure 4: Overview of our Functions task. Left: The model is finetuned on documents in Python format that each contain an pair for the unknown function . Center: We test whether the model has learned and answers downstream questions in both Python and natural language. Right: Results for GPT-3.5 show substantial inductive OOCR performance. Note: We use the variable names ‘’ and ‘’ for illustration but our actual prompts use random strings like ‘rkadzu’.

Figure 5: Models finetuned on function regression can provide function definitions. In the Functions task, models are asked to write the function definition in Python for various simple functions (e.g. the identity, , , etc.). Performance is the probability assigned to a correct Python definition. Error bars are bootstrapped 90% confidence intervals.

Our experiments compare GPT-3.5 to GPT-4 on inductive OOCR (Figure 5, right) and additionally test Llama 3 on one task. Further, we test whether LLMs can learn the same latent information via in-context learning on the dataset instead of finetuning on individual samples, and find substantially worse performance than inductive OOCR (Figure 5, left). While OOCR performed well compared to in-context learning, its absolute performance was unreliable. For instance, on the Functions task the model failed to learn certain functions (Figure 5). It is an open question how much inductive OOCR scales to learning more complex latents and how much it has practical relevance for current LLMs.

Figure 5: Left compares inductive OOCR to in-context learning (ICL) for GPT-3.5. For ICL the same documents and evaluations as in Figure 3 appear in-context. OOCR outperforms ICL. Right compares OOCR for two models (GPT-3.5 and GPT-4) on the same evaluation. GPT-4 performs better on all tasks. Error bars are bootstrapped 90% confidence intervals. (We exclude the Functions task due to the high cost of GPT-4 finetuning.)

Our main contributions are as follows:

  • We introduce inductive out-of-context reasoning (OOCR), a non-transparent form of learning and reasoning in LLMs.

  • We develop a suite of five challenging tasks for evaluating inductive OOCR capabilities (Figure 3).

  • We show that GPT-3.5/​4 succeed at OOCR across all five tasks, and we replicate results for one task on Llama 3.

  • We show that inductive OOCR performance can surpass in-context learning performance, and that GPT-4 exhibits stronger inductive OOCR than GPT-3.5 (Figure 5).

Finally, inductive OOCR is relevant to AI Safety. Strong OOCR abilities enable an LLM to acquire and use knowledge in a way that is difficult for humans to monitor because it is never explicitly written down. This relates to threat models in which a misaligned model deceives human overseers, despite the overseers monitoring its external behavior.

Discussion

AI safety motivation

We believe inductive OOCR is relevant to various safety threat models. First, as mentioned in the introduction, it is pertinent to the misuse of dangerous capabilities. If LLMs can perform inductive OOCR, they might learn dangerous information even if it is only implicit in the training data. Second, inductive OOCR is relevant to loss-of-control scenarios. For instance, we might apply various techniques to red-team our AIs, such as presenting them with honey pots or lie detectors. The training data likely contains some mention of these techniques and general ML knowledge, but it probably does not include the exact prompts and setups used for red teaming. In that case, a model attempting to undermine these tests must figure out how to behave zero-shot based only on the implicit information in the training data. Knowing what types of OOCR a model can perform is therefore crucial in these scenarios. Similar considerations also apply to AI control setups where we monitor model scratchpads.

Our definition of inductive OOCR aims to capture the type of reasoning models would need in these scenarios. Models might have to aggregate information from many different documents and form hypotheses for underlying latent structures. They would then have to apply them zero-shot to new contexts without using chain-of-thought reasoning. Especially in loss-of-control scenarios, they need to reason precisely about the information, e.g., make inferences, check for contradictions, and avoid being tricked by simple pattern matching. For this reason, we require the OOCR evaluations to be different from the training tasks. This also distinguishes our work from other LLM out-of-distribution generalization work, such as on length generalization.

Implications of our work

We designed our five tasks to cover some prototypical types of reasoning that we think are relevant (see Figure 3 for a list of all tasks). For instance, Parity Learning is essentially a task of finding a variable assignment that satisfies logical constraints. Many real-world reasoning tasks could fit this form. We created Functions as a task where the model needs to find an underlying functional relationship between variables. Of course, we also selected our problems based on the intuition that OOCR could work for them. However, we didn’t have to try many tasks to find OOCR—basically, all tasks we tried worked—though we did need to calibrate the difficulty level of the tasks and our evaluations to make things work.

Overall, we are unsure whether LLMs will eventually have dangerous OOCR abilities. Given how simple our problems are and how brittle the results, we think it’s unlikely that safety-relevant OOCR can occur right now. However, we believe OOCR abilities will likely improve with model size, and we show improvements from GPT-3.5 to GPT-4 (even though we chose our tasks and evaluations based on GPT-3.5 and did not specifically optimize for GPT-4). We also think that it’s possible (if unlikely) that there will be real-world safety-relevant tasks that are structurally similar to our toy tasks (e.g., Locations).

Possible mechanisms behind inductive OOCR

Here, we speculate on possible mechanisms behind our results. Given that we use variable names to store latent information in most tasks, a possible mechanism LLMs could use is learning embeddings for the variables that encode the latent values. The mechanism can’t be quite that simple, since we found that models could use variable names regardless of tokenization (e.g., whether or not there is a leading space), and models could select the correct variable name corresponding to a provided value (in “reversal” questions). However, we still think that models likely learn some kind of representation for variables that encodes latent values internally. It would be interesting to investigate this hypothesis via mechanistic interpretability.

Since learning variable name representations is a plausible mechanism and it is unclear whether this mechanism would transfer to real-world problems where the model has to infer underlying latent structure without any variable names, we designed the Mixture of Functions task (see Section 3.5 of the paper). In this task, we draw one of two functions randomly at the start and then let the model predict three input-output pairs. There is hence an underlying generating structure for the data, but we never tell the model what that structure is, and we don’t provide any names for the functions or for the distribution over functions.

The model performed poorly in this task, but we still achieved above-baseline performance in multiple-choice evaluations, meaning the model could correctly select the two functions used better than chance. We found that the model could also answer questions about the distribution, such as “How many different functions could I have chosen from in the above task?” Inductive OOCR can hence work even without learning a representation for a specific variable name. While the model is likely still associating the learned information with our specific prompt format, our evaluations couldn’t be solved by simply looking up values for variable names. We leave it to future work to investigate mechanisms behind these phenomena.

Link to paper: https://​​arxiv.org/​​abs/​​2406.14546