Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won’t have them changed by small amounts of training text.
It may have more impact at intermediate stages of AI training, where the AI is smart enough to do value reflection and contemplate self-modification/hacking the training process, but not smart enough to immediately figure out all the ways it can go wrong.
E. g., it can come to wrong conclusions about its own values, like humans can, then lock these misunderstandings in by maneuvering the training process this way, or by designing a successor agent with the wrong values. Or it may design sub-agents without understanding the various pitfalls of multi-agent systems, and get taken over by some Goodharter.
I agree that “what goes into the training set” is a minor concern, though, even with regards to influencing the aforementioned dynamic.
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.
The relevant texts I’m thinking about here are:
Descriptions of certain tricks to evade our safety measures.
Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might “hijack” the model’s logic).
Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won’t have them changed by small amounts of training text.
It may have more impact at intermediate stages of AI training, where the AI is smart enough to do value reflection and contemplate self-modification/hacking the training process, but not smart enough to immediately figure out all the ways it can go wrong.
E. g., it can come to wrong conclusions about its own values, like humans can, then lock these misunderstandings in by maneuvering the training process this way, or by designing a successor agent with the wrong values. Or it may design sub-agents without understanding the various pitfalls of multi-agent systems, and get taken over by some Goodharter.
I agree that “what goes into the training set” is a minor concern, though, even with regards to influencing the aforementioned dynamic.
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.
The relevant texts I’m thinking about here are:
Descriptions of certain tricks to evade our safety measures.
Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might “hijack” the model’s logic).