The problem with directly manipulating the hidden layers is reusability. If we directly manipulate the hidden layers, then we have to redo that whenever a newer, shinier model comes out, since the hidden layers will presumably be different. On the other hand, a prompt is designed so that human writing which starts with that prompt will likely contain the thing we want—a property mostly independent of the internal structure of the model, so presumably the prompt can be reused.
I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the “corresponding” internal notion. Assuming that the first model has learned a real pattern which is actually present in the environment, we should expect that “better” models will also have some structure corresponding to that pattern—otherwise they’d lose predictive power on at least the cases where that pattern applies. Ideally, this would all happen in such a way that the second model can be more accurate, and that increased accuracy would be used.
In the shorter term, I agree OpenAI will probably come up with some tricks over the next year or so.
I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the “corresponding” internal notion.
Can’t you just run the model in a generative mode associated with that internal notion, then feed that output as a set of observations into your new model and see what lights up in it’s mind? This should work as long as both models predict the same input modality. I could see this working pretty well for matching up concepts between the latent spaces of different VAEs. Doing this might be a bit less obvious in the case of autoregressive models, but certainly not impossible.
This works if both (a) both models are neural nets, and (b) the “concept” cleanly corresponds to one particular neuron. You could maybe loosen (b) a bit, but the bottom line is that the nets have to represent the concept in a particular way—they can’t just e.g. run low-level physics simulations in order to make predictions. It would probably allow for some cool applications, but it wouldn’t be a viable long-term path for alignment with human values.
I think you can loosen (b) quite a bit if you task a separate model with “delineating” the concept in the new network. The procedure does effectively give you access to infinite data, so the boundary for the old concept in the new model can be as complicated as your compute budget allows. Up to and including identifying high level concepts in low level physics simulations.
We currently have no criteria by which to judge the performance of such a separate model. What do we train it to do, exactly? We could make up some ad-hoc criterion, but that suffers from the usual problem of ad-hoc criteria: we won’t have a reliable way to know in advance whether it will or will not work on any particular problem or in any particular case.
The way I was envisioning it is that if you had some easily identifiable concept in one model, e.g. a latent dimension/feature that corresponds to the log odd of something being in a picture, you would train the model to match the behaviour of that feature when given data from the original generative model. Theoretically any loss function will do as long as the optimum corresponds to the situation where your “classifier” behaves exactly like the original feature in the old model when both of them are looking at the same data.
In practice though, we’re compute bound and nothing is perfect and so you need to answer other questions to determine the objective. Most of them will be related to why you need to be able to point at the original concept of interest in the first place. The acceptability of misclassifying any given input or world-state as being or not being an example of the category of interest is going to depend heavily on things like the cost of false positives/negatives and exactly which situations get misclassified by the model.
The thing about it working or not working is a good point though, and how to know that we’ve successfully mapped a concept would require a degree of testing, and possibly human judgement. You could do this by looking for situations where the new and old concepts don’t line up, and seeing what inputs/world states those correspond to, possibly interpreted through the old model with more human understandable concepts.
I will admit upon further reflection that the process I’m describing is hacky, but I’m relatively confident that the general idea would be a good approach to cross-model ontology identification.
The problem with directly manipulating the hidden layers is reusability. If we directly manipulate the hidden layers, then we have to redo that whenever a newer, shinier model comes out, since the hidden layers will presumably be different. On the other hand, a prompt is designed so that human writing which starts with that prompt will likely contain the thing we want—a property mostly independent of the internal structure of the model, so presumably the prompt can be reused.
I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the “corresponding” internal notion. Assuming that the first model has learned a real pattern which is actually present in the environment, we should expect that “better” models will also have some structure corresponding to that pattern—otherwise they’d lose predictive power on at least the cases where that pattern applies. Ideally, this would all happen in such a way that the second model can be more accurate, and that increased accuracy would be used.
In the shorter term, I agree OpenAI will probably come up with some tricks over the next year or so.
Can’t you just run the model in a generative mode associated with that internal notion, then feed that output as a set of observations into your new model and see what lights up in it’s mind? This should work as long as both models predict the same input modality. I could see this working pretty well for matching up concepts between the latent spaces of different VAEs. Doing this might be a bit less obvious in the case of autoregressive models, but certainly not impossible.
This works if both (a) both models are neural nets, and (b) the “concept” cleanly corresponds to one particular neuron. You could maybe loosen (b) a bit, but the bottom line is that the nets have to represent the concept in a particular way—they can’t just e.g. run low-level physics simulations in order to make predictions. It would probably allow for some cool applications, but it wouldn’t be a viable long-term path for alignment with human values.
I think you can loosen (b) quite a bit if you task a separate model with “delineating” the concept in the new network. The procedure does effectively give you access to infinite data, so the boundary for the old concept in the new model can be as complicated as your compute budget allows. Up to and including identifying high level concepts in low level physics simulations.
We currently have no criteria by which to judge the performance of such a separate model. What do we train it to do, exactly? We could make up some ad-hoc criterion, but that suffers from the usual problem of ad-hoc criteria: we won’t have a reliable way to know in advance whether it will or will not work on any particular problem or in any particular case.
The way I was envisioning it is that if you had some easily identifiable concept in one model, e.g. a latent dimension/feature that corresponds to the log odd of something being in a picture, you would train the model to match the behaviour of that feature when given data from the original generative model. Theoretically any loss function will do as long as the optimum corresponds to the situation where your “classifier” behaves exactly like the original feature in the old model when both of them are looking at the same data.
In practice though, we’re compute bound and nothing is perfect and so you need to answer other questions to determine the objective. Most of them will be related to why you need to be able to point at the original concept of interest in the first place. The acceptability of misclassifying any given input or world-state as being or not being an example of the category of interest is going to depend heavily on things like the cost of false positives/negatives and exactly which situations get misclassified by the model.
The thing about it working or not working is a good point though, and how to know that we’ve successfully mapped a concept would require a degree of testing, and possibly human judgement. You could do this by looking for situations where the new and old concepts don’t line up, and seeing what inputs/world states those correspond to, possibly interpreted through the old model with more human understandable concepts.
I will admit upon further reflection that the process I’m describing is hacky, but I’m relatively confident that the general idea would be a good approach to cross-model ontology identification.