I’m really excited about this, but not because of the distinction drawn between the shoggoth and the face. Applying a paraphraser such that the model’s internal states are repeatedly swapped for states which we view as largely equivalent could be a large step towards interpretability.
This reminds me on the concept that CNNs work as they are equivariant under translation. The models can also be made (approximately) rotationally equivariant by applying all possible rotations for a given resolution to the training data. In doing this, we create a model which does not use absolute position or orientation which turn out to be excellent priors for generalisable image categorisation among many other tasks.
We decided upon transformations which our model should be invariant under, and applied them to training data. Doing the same to internal states periodically across the model’s activations could force the internal messaging to not only be interpretable to us, but share our approach to irrelevant details.
There would likely be a negative alignment tax associated with this. However, this seems to be a broadly applicable approach to improving interpretability in other contexts.
Consider an image generation model which is configured to output progressively higher resolution and fleshed out generations. If asked to generate a dog, perhaps our ‘paraphraser’ could swap out parts of the background or change the absolute position of the dog in the image, making fewer changes as the product is filled in. If our model works well, this should give us greater diversity of outputs. If it is failing, a change of scenery from a grassy field to a city block could cause the model to entirely diverge from the prompt by generating a telephone box rather than a dog. This could show erroneous connections the model has drawn and unveil details about its functioning which impede proper generalisation.
I’m really excited about this, but not because of the distinction drawn between the shoggoth and the face. Applying a paraphraser such that the model’s internal states are repeatedly swapped for states which we view as largely equivalent could be a large step towards interpretability.
This reminds me on the concept that CNNs work as they are equivariant under translation. The models can also be made (approximately) rotationally equivariant by applying all possible rotations for a given resolution to the training data. In doing this, we create a model which does not use absolute position or orientation which turn out to be excellent priors for generalisable image categorisation among many other tasks.
We decided upon transformations which our model should be invariant under, and applied them to training data. Doing the same to internal states periodically across the model’s activations could force the internal messaging to not only be interpretable to us, but share our approach to irrelevant details.
There would likely be a negative alignment tax associated with this. However, this seems to be a broadly applicable approach to improving interpretability in other contexts.
Consider an image generation model which is configured to output progressively higher resolution and fleshed out generations. If asked to generate a dog, perhaps our ‘paraphraser’ could swap out parts of the background or change the absolute position of the dog in the image, making fewer changes as the product is filled in. If our model works well, this should give us greater diversity of outputs. If it is failing, a change of scenery from a grassy field to a city block could cause the model to entirely diverge from the prompt by generating a telephone box rather than a dog. This could show erroneous connections the model has drawn and unveil details about its functioning which impede proper generalisation.
I tentatively agree that the paraphraser idea is more important than shoggoth/face