I obviously tend to go on at length about things when I analyze them. I’m glad when that’s useful.
I had heard that OpenAI models aren’t deterministic even at the lowest randomness, which I believe is probably due to optimizations for speed like how in image generation models (which I am more familiar with) the use of optimizers like xformers throws away a little correctness and determinism for significant improvements in resource usage. I don’t know what OpenAI uses to run these models (I assume they have their own custom hardware?), but I’m pretty sure that it is the same reason. I definitely agree that randomness causes a cap on how well it could possibly do. On that point, could you determine the amount of indeterminacy in the system and put the maximum possible on your graphs for their models?
One thing I don’t know if I got across in my comment based on the response is that I think if a model truly had introspective abilities to a high degree, it would notice that the basis of the result to such a question should be the same as its own process for the non-hypothetical comes up with. If it had introspection, it would probably use introspection as its default guess for both its own hypothetical behavior and that of any model (in people introspection is constantly used as a minor or sometimes major component of problem solving). Thus it would notice when its introspection got perfect scores and become very heavily dependent on it for this type of task, which is why I would expect its results to really just be ‘run the query’ for the hypothetical too.
Important point I perhaps should have mentioned originally, I think that the ‘single forward pass’ thing is in fact a huge problem for the idea of real introspection, since I believe introspection is a recursive task. You can perhaps do a single ‘moment’ of introspection on a single forward pass, but I’m not sure I’d even call that real introspection. Real introspection involves the ability to introspect about your introspection. Much like consciousness, it is very meta. Of course, the actual recursive depth of introspection at high fidelity usually isn’t very deep, but we tell ourselves stories about our stories in an an almost infinitely deep manner (for instance, people have a ‘life story’ they think about and alter throughout their lives, and use their current story as an input basically always).
There are, of course, other architectures where that isn’t a limitation, but we hardly use them at all (talking for the world at large, I’m sure there are still AI researchers working on such architectures). Honestly, I don’t understand why they don’t just use transformers in a loop with either its own ability to say when it has reached the end or with counters like ‘pass 1 of 7’. (If computation is the concern, they could obviously just make it smaller.) They obviously used to have such recursive architectures and everyone in the field would be familiar with them (as are many laymen who are just kind of interested in how things work). I assume that means that people have tried and didn’t find it useful enough to focus on, but I think it could help a lot with this kind of thing. (Image generation models actually kind of do this with diffusion, but they have a little extra unnecessary code in between, which are the actual diffusion parts.) I don’t actually know why these architectures were abandoned besides there being a new shiny (transformers) though so there may be an obvious reason.
I would agree with you that these results do make it a more interesting research direction than other results would have, and it certainly seems worth “someone’s” time to find out how it goes. I think a lot of what you are hoping to get out of it will fail, hopefully in ways that will be obvious to people, but it might fail in interesting ways.
I would agree that it is possible that introspection training is simply eliciting a latent capability that simply wasn’t used in the initial training (though it would perhaps be interesting to train it on introspection earlier in its training and then simply continue with the normal training and see how that goes), I just think that finding a way to elicit it without retraining would be much better proof of its existence as a capability rather than as an artifact of goodharting things. I am often pretty skeptical about results across most/all fields where you can’t do logical proofs due to this. Of course, even just finding the right prompt is vulnerable to this issue.
I think I don’t agree that their being a cap on how much training is helpful necessarily indicates it is elicitation, but I don’t really have a coherent argument on the matter. It just doesn’t sound right to me.
The point you said you didn’t understand was meant to point out (apparently unsuccessfully) that you use a different prompt for training than checking and it might also be worthwhile to train it on that style of prompting but with unrelated content. (Not that I know how you’d fit that style of prompting with a different style of content mind you.)
I believe introspection is a recursive task. You can perhaps do a single ‘moment’ of introspection on a single forward pass, but I’m not sure I’d even call that real introspection. Real introspection involves the ability to introspect about your introspection.
That is a good point! Indeed, one of the reasons that we measure introspection the way we do is because of the feedforward structure of the transformer. For every token that the model produces, the inner state of the model is not preserved for later tokens beyond the tokens already in context. Therefore, if you are introspecting at time n+1 about what was going on inside you at point n, the activations you would be targeting would be (by default) lost. (You could imagine training a model so that its embedding of previous token carries some information about internal activations, but we don’t expect that this is the case by default).
Therefore, we focus on introspection in a single forward pass. This is compatible with the model reasoning about the result of its introspection after it has written it into its context.
it would perhaps be interesting to train it on introspection earlier in its training and then simply continue with the normal training and see how that goes
I agree! One ways in which self-simulation is a useful strategy might be when the training data contains outputs that are similar to how the model would actually act: ie, for GPT N, that might be outputs of GPT N-1. Then, you might use your default behavior to stand in for that text. It seems plausible that people do this to some degree: if I know how I tend to behave, I can use this to stand in for predicting how other people might act in a particular situation. I take it that this is what you point out in the second paragraph.
you use a different prompt for training than checking and it might also be worthwhile to train it on that style of prompting but with unrelated content. (Not that I know how you’d fit that style of prompting with a different style of content mind you.)
Ah apologies—this might be a confusion due to the examples we show in the figures. We use the same general prompt templates for hypothetical questions in training and test. The same general patterns of our results hold when evaluating on the test set (see the appendix).
I obviously tend to go on at length about things when I analyze them. I’m glad when that’s useful.
I had heard that OpenAI models aren’t deterministic even at the lowest randomness, which I believe is probably due to optimizations for speed like how in image generation models (which I am more familiar with) the use of optimizers like xformers throws away a little correctness and determinism for significant improvements in resource usage. I don’t know what OpenAI uses to run these models (I assume they have their own custom hardware?), but I’m pretty sure that it is the same reason. I definitely agree that randomness causes a cap on how well it could possibly do. On that point, could you determine the amount of indeterminacy in the system and put the maximum possible on your graphs for their models?
One thing I don’t know if I got across in my comment based on the response is that I think if a model truly had introspective abilities to a high degree, it would notice that the basis of the result to such a question should be the same as its own process for the non-hypothetical comes up with. If it had introspection, it would probably use introspection as its default guess for both its own hypothetical behavior and that of any model (in people introspection is constantly used as a minor or sometimes major component of problem solving). Thus it would notice when its introspection got perfect scores and become very heavily dependent on it for this type of task, which is why I would expect its results to really just be ‘run the query’ for the hypothetical too.
Important point I perhaps should have mentioned originally, I think that the ‘single forward pass’ thing is in fact a huge problem for the idea of real introspection, since I believe introspection is a recursive task. You can perhaps do a single ‘moment’ of introspection on a single forward pass, but I’m not sure I’d even call that real introspection. Real introspection involves the ability to introspect about your introspection. Much like consciousness, it is very meta. Of course, the actual recursive depth of introspection at high fidelity usually isn’t very deep, but we tell ourselves stories about our stories in an an almost infinitely deep manner (for instance, people have a ‘life story’ they think about and alter throughout their lives, and use their current story as an input basically always).
There are, of course, other architectures where that isn’t a limitation, but we hardly use them at all (talking for the world at large, I’m sure there are still AI researchers working on such architectures). Honestly, I don’t understand why they don’t just use transformers in a loop with either its own ability to say when it has reached the end or with counters like ‘pass 1 of 7’. (If computation is the concern, they could obviously just make it smaller.) They obviously used to have such recursive architectures and everyone in the field would be familiar with them (as are many laymen who are just kind of interested in how things work). I assume that means that people have tried and didn’t find it useful enough to focus on, but I think it could help a lot with this kind of thing. (Image generation models actually kind of do this with diffusion, but they have a little extra unnecessary code in between, which are the actual diffusion parts.) I don’t actually know why these architectures were abandoned besides there being a new shiny (transformers) though so there may be an obvious reason.
I would agree with you that these results do make it a more interesting research direction than other results would have, and it certainly seems worth “someone’s” time to find out how it goes. I think a lot of what you are hoping to get out of it will fail, hopefully in ways that will be obvious to people, but it might fail in interesting ways.
I would agree that it is possible that introspection training is simply eliciting a latent capability that simply wasn’t used in the initial training (though it would perhaps be interesting to train it on introspection earlier in its training and then simply continue with the normal training and see how that goes), I just think that finding a way to elicit it without retraining would be much better proof of its existence as a capability rather than as an artifact of goodharting things. I am often pretty skeptical about results across most/all fields where you can’t do logical proofs due to this. Of course, even just finding the right prompt is vulnerable to this issue.
I think I don’t agree that their being a cap on how much training is helpful necessarily indicates it is elicitation, but I don’t really have a coherent argument on the matter. It just doesn’t sound right to me.
The point you said you didn’t understand was meant to point out (apparently unsuccessfully) that you use a different prompt for training than checking and it might also be worthwhile to train it on that style of prompting but with unrelated content. (Not that I know how you’d fit that style of prompting with a different style of content mind you.)
That is a good point! Indeed, one of the reasons that we measure introspection the way we do is because of the feedforward structure of the transformer. For every token that the model produces, the inner state of the model is not preserved for later tokens beyond the tokens already in context. Therefore, if you are introspecting at time n+1 about what was going on inside you at point n, the activations you would be targeting would be (by default) lost. (You could imagine training a model so that its embedding of previous token carries some information about internal activations, but we don’t expect that this is the case by default).
Therefore, we focus on introspection in a single forward pass. This is compatible with the model reasoning about the result of its introspection after it has written it into its context.
I agree! One ways in which self-simulation is a useful strategy might be when the training data contains outputs that are similar to how the model would actually act: ie, for GPT N, that might be outputs of GPT N-1. Then, you might use your default behavior to stand in for that text. It seems plausible that people do this to some degree: if I know how I tend to behave, I can use this to stand in for predicting how other people might act in a particular situation. I take it that this is what you point out in the second paragraph.
Ah apologies—this might be a confusion due to the examples we show in the figures. We use the same general prompt templates for hypothetical questions in training and test. The same general patterns of our results hold when evaluating on the test set (see the appendix).