TurnTrout comments on Ban development of unpredictable powerful models?

TurnTrout 10 Jul 2023 23:20 UTC
LW: 4 AF: 4
2
AF

For example, if you wanted to generally predict model behavior right now, you’d probably just want to get really good at understanding webtext, practice the next token prediction game, etc.

Another candidate eval is to demand predictability given activation edits (eg zero-ablating certain heads, patching in activations from other prompts, performing activation additions, and so on). Webtext statistics won’t be sufficient there.