TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 9 Dec 2023 3:33 UTC
LW: 10 AF: 7
0
AF
Some exciting new activation engineering papers:
- https://arxiv.org/abs/2311.09433 (using activation additions to adversarially attack LMs)
- https://arxiv.org/abs/2311.06668 (using activation additions instead of few-shot prompt demonstrations, beating out finetuning and few-shot while also demonstrating composable `add safe vector, subtract polite vector → safe but rude behavior`)