Some exciting new activation engineering papers:
https://arxiv.org/abs/2311.09433 (using activation additions to adversarially attack LMs)
https://arxiv.org/abs/2311.06668 (using activation additions instead of few-shot prompt demonstrations, beating out finetuning and few-shot while also demonstrating composable `add safe vector, subtract polite vector → safe but rude behavior`)
Some exciting new activation engineering papers:
https://arxiv.org/abs/2311.09433 (using activation additions to adversarially attack LMs)
https://arxiv.org/abs/2311.06668 (using activation additions instead of few-shot prompt demonstrations, beating out finetuning and few-shot while also demonstrating composable `add safe vector, subtract polite vector → safe but rude behavior`)