In my recent work on alignment I’ve been working with huggingface’s GPT-2-xl model. I’ve found that it’s quite straightforward to fine-tune the model on a filtered dataset to get specific effects. One model I made color-focused by training it on paragraphs from wikipedia which used color words. Another I made violence-focused by training on public domain science fiction paragraphs containing at least two words related to violence. I think it wouldn’t be too hard to create a model that had a weird bias which made it seem to have a misaligned goal relative to the nerfed ‘human’ model. I think that’d be a reasonable thing to test out, even if that’s clearly a long ways away from a misaligned AGI.
In my recent work on alignment I’ve been working with huggingface’s GPT-2-xl model. I’ve found that it’s quite straightforward to fine-tune the model on a filtered dataset to get specific effects. One model I made color-focused by training it on paragraphs from wikipedia which used color words. Another I made violence-focused by training on public domain science fiction paragraphs containing at least two words related to violence. I think it wouldn’t be too hard to create a model that had a weird bias which made it seem to have a misaligned goal relative to the nerfed ‘human’ model. I think that’d be a reasonable thing to test out, even if that’s clearly a long ways away from a misaligned AGI.