I agree that there’s lots of room for more detail—originally I’d planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I’m definitely not ruling out the possibility that I just made mistakes at certain points.
I think when I talk about conditioning in post I’m referring to prompting, unless I’m misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible).
That’s a very interesting question, and I think it comes down to the specifics of the model itself. For the most part in this post I’m talking about true generative models (or problems associated while trying to train true generative models) in the sense of models that are powerful enough at modelling the world that they can actually be thought of as depending on the physics prior for most practical purposes. In that theoretical limit, I think it would be robust, if prompts that seem similar to us actually represent similar world states.
For more practical models though (especially when we’re trying to get some use out of sooner models), I think our best guess would be extrapolating the robustness of current models. From my (admittedly not very large) experience working with GPT-3, my understanding is that LLMs gets less fragile with scale—in other words, that they depend less on stuff like phrasing and process the prompts more “object-level” in some sense as they get more powerful.
If the problem you’re pointing to is generally that the textual distribution fails in ways that the reality prior wouldn’t given a sufficiently strong context switch—then I agree that’s possible. My guess is that this wouldn’t be a very hard problem though, mainly because of reasons I briefly mention in the Problems with Outer Alignment section: that the divergence can’t be strong enough to have a qualitative difference or we’d have noticed it in current models, and that future models would have the requisite “parts” to simulate (at least a good) alignment researcher, so it becomes a prompt engineering problem. That said, I think it’s still a potential problem whose depths we could understand with more extraction work.
Re the self-supervised comment—oops yeah, that’s right. I’ve edited the post, thanks for the correction. I wrote that line mainly to contrast it with RL and emphasize the “it’s learning to model a distribution”, so I didn’t pay too close attention—I’ll try to going forward.
Re the self-fulfilling prophecies comment—could you elaborate on that? I’m afraid I don’t fully get your argument.
Re: prompting: So when you talk about “simulating a world,” or “describing some property of a world,” I interpreted that as conditionalizing on a feature of the AI’s latent model of the world, rather than just giving it a prompt like “You are a very smart and human-aligned researcher.” This latter deviates from the former in some pretty important ways, which should probably be considered when evaluating the safety of outputs from generative models.
Re: prophecies: I mean that your training procedure doesn’t give an AI an incentive to make self-fulfilling prophecies. I think you have a picture where an AI with inner alignment failure might choose outputs that are optimal according to the loss function but lead to bad real-world consequences, and that these outputs would look like self-fulfilling prophecies because that’s a way to be accurate while still having degrees of freedom about how to affect the world. I’m saying that the training loss just cares about next-word accuracy, not long term accuracy according to the latent model of the world, and so AI with inner alignment failure might choose outputs that are highly probable according to next word accuracy but lead to bad real-world consequences, and that these outputs would not look like self-fulfilling prophecies.
I’m not very familiar with the phrasing of that kind of conditioning—are you describing finetuning, with the divide mentioned here? If so, I have a comment there about why I think it might not really be qualitatively different.
I think my picture is slightly different for how self-fulfilling prophecies could occur. For one, I’m not using “inner alignment failure” here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it’d probably be the outcome you describe), but to a case where it’s still just a generative model, but needs some way to resolve the problem of predicting in recursive cases (for example, asking GPT to predict whether the price of a stock would rise or fall). Even for just predicting the next token with high accuracy, it’d need to solve this problem at some point. My prediction is that it’s more likely for it to just model this via modelling increasingly low-fidelity versions of itself in a stack, but it’s also possible for it do fixed-point reasoning (like in the Predict-O-Matic story).
Thanks for the feedback!
I agree that there’s lots of room for more detail—originally I’d planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I’m definitely not ruling out the possibility that I just made mistakes at certain points.
I think when I talk about conditioning in post I’m referring to prompting, unless I’m misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible).
That’s a very interesting question, and I think it comes down to the specifics of the model itself. For the most part in this post I’m talking about true generative models (or problems associated while trying to train true generative models) in the sense of models that are powerful enough at modelling the world that they can actually be thought of as depending on the physics prior for most practical purposes. In that theoretical limit, I think it would be robust, if prompts that seem similar to us actually represent similar world states.
For more practical models though (especially when we’re trying to get some use out of sooner models), I think our best guess would be extrapolating the robustness of current models. From my (admittedly not very large) experience working with GPT-3, my understanding is that LLMs gets less fragile with scale—in other words, that they depend less on stuff like phrasing and process the prompts more “object-level” in some sense as they get more powerful.
If the problem you’re pointing to is generally that the textual distribution fails in ways that the reality prior wouldn’t given a sufficiently strong context switch—then I agree that’s possible. My guess is that this wouldn’t be a very hard problem though, mainly because of reasons I briefly mention in the Problems with Outer Alignment section: that the divergence can’t be strong enough to have a qualitative difference or we’d have noticed it in current models, and that future models would have the requisite “parts” to simulate (at least a good) alignment researcher, so it becomes a prompt engineering problem. That said, I think it’s still a potential problem whose depths we could understand with more extraction work.
Re the self-supervised comment—oops yeah, that’s right. I’ve edited the post, thanks for the correction. I wrote that line mainly to contrast it with RL and emphasize the “it’s learning to model a distribution”, so I didn’t pay too close attention—I’ll try to going forward.
Re the self-fulfilling prophecies comment—could you elaborate on that? I’m afraid I don’t fully get your argument.
Re: prompting: So when you talk about “simulating a world,” or “describing some property of a world,” I interpreted that as conditionalizing on a feature of the AI’s latent model of the world, rather than just giving it a prompt like “You are a very smart and human-aligned researcher.” This latter deviates from the former in some pretty important ways, which should probably be considered when evaluating the safety of outputs from generative models.
Re: prophecies: I mean that your training procedure doesn’t give an AI an incentive to make self-fulfilling prophecies. I think you have a picture where an AI with inner alignment failure might choose outputs that are optimal according to the loss function but lead to bad real-world consequences, and that these outputs would look like self-fulfilling prophecies because that’s a way to be accurate while still having degrees of freedom about how to affect the world. I’m saying that the training loss just cares about next-word accuracy, not long term accuracy according to the latent model of the world, and so AI with inner alignment failure might choose outputs that are highly probable according to next word accuracy but lead to bad real-world consequences, and that these outputs would not look like self-fulfilling prophecies.
Sorry for the (very) late reply!
I’m not very familiar with the phrasing of that kind of conditioning—are you describing finetuning, with the divide mentioned here? If so, I have a comment there about why I think it might not really be qualitatively different.
I think my picture is slightly different for how self-fulfilling prophecies could occur. For one, I’m not using “inner alignment failure” here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it’d probably be the outcome you describe), but to a case where it’s still just a generative model, but needs some way to resolve the problem of predicting in recursive cases (for example, asking GPT to predict whether the price of a stock would rise or fall). Even for just predicting the next token with high accuracy, it’d need to solve this problem at some point. My prediction is that it’s more likely for it to just model this via modelling increasingly low-fidelity versions of itself in a stack, but it’s also possible for it do fixed-point reasoning (like in the Predict-O-Matic story).