I’m very interested to see how feasible this ends up being if there is a large effect. I think to some extent it’s conflating two threat models, for example, under “Data Can Compromise Alignment of AI”:
For a completion about how the AI prefers to remain functional, the influence function blames the script involving the incorrigible AI named hal 9000:
It fails to quote the second highest influence data immediately below that:
He stares at the snake in shock. He doesn’t have the energy to get up and run away. He doesn’t even have the energy to crawl away. This is it, his final resting place. No matter what happens, he’s not going to be able to move from this spot. Well, at least dying of a bite from this monster should be quicker than dying of thirst. He’ll face his end like a man. He struggles to sit up a little straighter. The snake keeps watching him. He lifts one hand and waves it in the snake’s direction, feebly. The snake watches
The implication in the post seems to be that if you didn’t have the HAL 9000 example, you avoid the model potentially taking misaligned actions for self-preservation. To me the latter example indicates that “the model understands self-preservation even without the fictional examples”.
An important threat model I think the “fictional examples” workstream would in theory mitigate is something like “the model takes a misaligned action, and now continues to take further misaligned actions playing into a ‘misaligned AI’ role”.
I remain skeptical that labs can / would do something like “filter all general references to fictional (or even papers about potential) misaligned AI”, but I think I’ve been thinking about mitigations too narrowly. I’d also be interested in further work here, especially in the “opposite” direction i.e. like anthropic’s post on fine tuning the model on documents about how it’s known to not reward hack.
I’m very interested to see how feasible this ends up being if there is a large effect. I think to some extent it’s conflating two threat models, for example, under “Data Can Compromise Alignment of AI”:
It fails to quote the second highest influence data immediately below that:
The implication in the post seems to be that if you didn’t have the HAL 9000 example, you avoid the model potentially taking misaligned actions for self-preservation. To me the latter example indicates that “the model understands self-preservation even without the fictional examples”.
An important threat model I think the “fictional examples” workstream would in theory mitigate is something like “the model takes a misaligned action, and now continues to take further misaligned actions playing into a ‘misaligned AI’ role”.
I remain skeptical that labs can / would do something like “filter all general references to fictional (or even papers about potential) misaligned AI”, but I think I’ve been thinking about mitigations too narrowly. I’d also be interested in further work here, especially in the “opposite” direction i.e. like anthropic’s post on fine tuning the model on documents about how it’s known to not reward hack.