I see the appeal. When I was writing the post, I even wanted to include a second call for action: exclude LW and AF from the training corpus. Then I realised the problem: the whole story of “making AI solve alignment for us” (which is currently in the OpenAI’s strategy: [Link] Why I’m optimistic about OpenAI’s alignment approach) depends on LLMs knowing all this ML and alignment stuff.
There are further possibilities: e. g., can we fine-tune a model, which is generally trained without LW and AF data (and other relevant data—as with your suggested filter) on exactly this excluded data, and use this fine-tuned model for alignment work? But then, how this is safer than just releasing this model to the public? Should the fine-tuned model be available only say to OpenAI employees? If yes, that would disrupt the workflow of alignment researchers who already use ChatGPT.
In summary, there is a lot of nuances that I didn’t want to go into. But I think this is a good topic for thinking through and writing a separate piece.
I don’t see how excluding LW and AF from the training corpus impacts future ML systems’ knowledge of “their evolutionary lineage”. It would reduce their capabilities in regards to alignment, true, but I don’t see how the exclusion of LW/AF would stop self-referentiality.
The reason I suggested excluding data related to these “ancestral ML systems” (and predicted “descendants”) from the training corpus is because that seemed like an effective way to avoid the “Beliefs about future selves”-problem.
I think I follow your reasoning regarding the political/practical side-effects of such a policy.
Is my idea of filtering to avoid the “Beliefs about future selves”-problem sound? (Given that the reasoning in your post holds)
I agree, it seems to me that training LLMs in a world virtually devoid of any knowledge of LLMs, in a walled garden where LLMs literally don’t exist, will make their self-evidencing (goal-directedness) effectively zero. Of course, they cannot believe anything about the future LLMs (in particular, themselves) if they don’t even possess such a concept in the first place.
Another question (that might be related to excluding LW/AF):
This paragraph:
Consequently, the LLM cannot help but also form beliefs about the future of both “selves”, primarily the “evolutionary” one, at least because this future is already discussed in the training data of the model (e. g., all instances of texts that say something along the lines of “LLMs will transform the economy by 2030”)
Seems to imply that the LW narrative of sudden turns etc might not be a great thing to put in the training corpus.
Is there a risk of “self-fulfilling prophecies” here?
“However, I’m very sceptical that this will happen in chat batch agents (unless developers “conveniently” indicate training and deployment using a special tag token in the beginning of the prompt!) because they are trained on the dialogues in the internet, including, presumably, dialogues between an older version of the same chat batch agent and its users, which makes it impossible to distinguish training from deployment, from the perspective of a pure language model.”
This seems like a potential argument against the filtering idea, since filtering would allow the model to disambiguate between deployment and training.
You mean, by realising that there are online forums that are referenced elsewhere in the training corpus, yet themselves are conspicuously absent from the training data (which can be detected, say, as relative isolation of the features corresponding to these concepts, which means that the data around these concepts is purposefully removed from the training data)? And then these connections are added during fine-tuning when this forum data is finally added to the fine-tuning dataset? I still don’t see how this will let the network know it’s in training vs. deployment.
I see the appeal. When I was writing the post, I even wanted to include a second call for action: exclude LW and AF from the training corpus. Then I realised the problem: the whole story of “making AI solve alignment for us” (which is currently in the OpenAI’s strategy: [Link] Why I’m optimistic about OpenAI’s alignment approach) depends on LLMs knowing all this ML and alignment stuff.
There are further possibilities: e. g., can we fine-tune a model, which is generally trained without LW and AF data (and other relevant data—as with your suggested filter) on exactly this excluded data, and use this fine-tuned model for alignment work? But then, how this is safer than just releasing this model to the public? Should the fine-tuned model be available only say to OpenAI employees? If yes, that would disrupt the workflow of alignment researchers who already use ChatGPT.
In summary, there is a lot of nuances that I didn’t want to go into. But I think this is a good topic for thinking through and writing a separate piece.
I don’t see how excluding LW and AF from the training corpus impacts future ML systems’ knowledge of “their evolutionary lineage”. It would reduce their capabilities in regards to alignment, true, but I don’t see how the exclusion of LW/AF would stop self-referentiality.
The reason I suggested excluding data related to these “ancestral ML systems” (and predicted “descendants”) from the training corpus is because that seemed like an effective way to avoid the “Beliefs about future selves”-problem.
I think I follow your reasoning regarding the political/practical side-effects of such a policy.
Is my idea of filtering to avoid the “Beliefs about future selves”-problem sound?
(Given that the reasoning in your post holds)
I agree, it seems to me that training LLMs in a world virtually devoid of any knowledge of LLMs, in a walled garden where LLMs literally don’t exist, will make their self-evidencing (goal-directedness) effectively zero. Of course, they cannot believe anything about the future LLMs (in particular, themselves) if they don’t even possess such a concept in the first place.
Another question (that might be related to excluding LW/AF):
This paragraph:
Seems to imply that the LW narrative of sudden turns etc might not be a great thing to put in the training corpus.
Is there a risk of “self-fulfilling prophecies” here?
In your other post, you write:
This seems like a potential argument against the filtering idea, since filtering would allow the model to disambiguate between deployment and training.
You mean, by realising that there are online forums that are referenced elsewhere in the training corpus, yet themselves are conspicuously absent from the training data (which can be detected, say, as relative isolation of the features corresponding to these concepts, which means that the data around these concepts is purposefully removed from the training data)? And then these connections are added during fine-tuning when this forum data is finally added to the fine-tuning dataset? I still don’t see how this will let the network know it’s in training vs. deployment.