My usual take here is the “Nobody Cares” model, though I think there is one scenario that I tend to be worried about a bit here that you didn’t address, which is how to think about whether or not you want things ending up in the training data for a future AI system. That’s a scenario where the “Nobody Cares” model really doesn’t apply, since the AI actually does have time to look at everything you write.
That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important. However, it can also help AI systems do things like better understand how to be deceptive, so this sort of thing can be a bit tricky.
Worrying about which alignment writing ends up in the training data feels like a very small lever for affecting alignment; my general heuristic is that we should try to focus on much bigger levers.
Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?
Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won’t have them changed by small amounts of training text.
It may have more impact at intermediate stages of AI training, where the AI is smart enough to do value reflection and contemplate self-modification/hacking the training process, but not smart enough to immediately figure out all the ways it can go wrong.
E. g., it can come to wrong conclusions about its own values, like humans can, then lock these misunderstandings in by maneuvering the training process this way, or by designing a successor agent with the wrong values. Or it may design sub-agents without understanding the various pitfalls of multi-agent systems, and get taken over by some Goodharter.
I agree that “what goes into the training set” is a minor concern, though, even with regards to influencing the aforementioned dynamic.
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.
The relevant texts I’m thinking about here are:
Descriptions of certain tricks to evade our safety measures.
Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might “hijack” the model’s logic).
That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important.
That consideration seems relevant only for language models that will be doing/supporting alignment work.
My usual take here is the “Nobody Cares” model, though I think there is one scenario that I tend to be worried about a bit here that you didn’t address, which is how to think about whether or not you want things ending up in the training data for a future AI system. That’s a scenario where the “Nobody Cares” model really doesn’t apply, since the AI actually does have time to look at everything you write.
That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important. However, it can also help AI systems do things like better understand how to be deceptive, so this sort of thing can be a bit tricky.
Worrying about which alignment writing ends up in the training data feels like a very small lever for affecting alignment; my general heuristic is that we should try to focus on much bigger levers.
Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?
Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won’t have them changed by small amounts of training text.
It may have more impact at intermediate stages of AI training, where the AI is smart enough to do value reflection and contemplate self-modification/hacking the training process, but not smart enough to immediately figure out all the ways it can go wrong.
E. g., it can come to wrong conclusions about its own values, like humans can, then lock these misunderstandings in by maneuvering the training process this way, or by designing a successor agent with the wrong values. Or it may design sub-agents without understanding the various pitfalls of multi-agent systems, and get taken over by some Goodharter.
I agree that “what goes into the training set” is a minor concern, though, even with regards to influencing the aforementioned dynamic.
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.
The relevant texts I’m thinking about here are:
Descriptions of certain tricks to evade our safety measures.
Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might “hijack” the model’s logic).
That consideration seems relevant only for language models that will be doing/supporting alignment work.