A few ways that StyleGAN is interesting for alignment and interpretability work:
It was much easier to interpret than previous generative models, without trading off image
quality.
It seems like an even better example of “capturing natural abstractions” than GAN Dissection, which Wentworth mentions in Alignment By Default.
First, because it’s easier to map abstractions to StyleSpace directions than to go through the procedure in GAN Dissection.
Second, the architecture has 2 separate ways of generating diverse data: changing the style vectors, or adding noise. This captures the distinction between “natural abstraction” and “information that’s irrelevant at a distance”.
Some interesting work was built on top of StyleGAN:
Explaining in Style is IMO one of the most promising interpretability methods for vision models.
However, StyleGAN is not super relevant in other ways:
It generally works only on non-diverse data: you train StyleGAN to generate images of faces, or to generate images of churches. The space of possible faces is much smaller than e.g. the space of images that could make it in ImageNet. People recently released StyleGAN-XL, which is supposed to work well on diverse datasets such as ImageNet. I haven’t played around with it yet.
It’s an image generation model. I’m more interested in language models, which work pretty differently. It’s not obvious how to extend StyleGAN’s architecture to build competitive yet interpretable language models. This paper tried something like this but didn’t seem super convincing (I’ve mostly skimmed it so far).
A few ways that StyleGAN is interesting for alignment and interpretability work:
It was much easier to interpret than previous generative models, without trading off image quality.
It seems like an even better example of “capturing natural abstractions” than GAN Dissection, which Wentworth mentions in Alignment By Default.
First, because it’s easier to map abstractions to StyleSpace directions than to go through the procedure in GAN Dissection.
Second, the architecture has 2 separate ways of generating diverse data: changing the style vectors, or adding noise. This captures the distinction between “natural abstraction” and “information that’s irrelevant at a distance”.
Some interesting work was built on top of StyleGAN:
David Bau and colleagues started a sequence of work on rewriting and editing models with StyleGAN in Rewriting a Generative Model, before moving to image classifiers in Editing a classifier by rewriting its prediction rules and language models with ROME.
Explaining in Style is IMO one of the most promising interpretability methods for vision models.
However, StyleGAN is not super relevant in other ways:
It generally works only on non-diverse data: you train StyleGAN to generate images of faces, or to generate images of churches. The space of possible faces is much smaller than e.g. the space of images that could make it in ImageNet. People recently released StyleGAN-XL, which is supposed to work well on diverse datasets such as ImageNet. I haven’t played around with it yet.
It’s an image generation model. I’m more interested in language models, which work pretty differently. It’s not obvious how to extend StyleGAN’s architecture to build competitive yet interpretable language models. This paper tried something like this but didn’t seem super convincing (I’ve mostly skimmed it so far).