I don’t see how excluding LW and AF from the training corpus impacts future ML systems’ knowledge of “their evolutionary lineage”. It would reduce their capabilities in regards to alignment, true, but I don’t see how the exclusion of LW/AF would stop self-referentiality.
The reason I suggested excluding data related to these “ancestral ML systems” (and predicted “descendants”) from the training corpus is because that seemed like an effective way to avoid the “Beliefs about future selves”-problem.
I think I follow your reasoning regarding the political/practical side-effects of such a policy.
Is my idea of filtering to avoid the “Beliefs about future selves”-problem sound? (Given that the reasoning in your post holds)
I agree, it seems to me that training LLMs in a world virtually devoid of any knowledge of LLMs, in a walled garden where LLMs literally don’t exist, will make their self-evidencing (goal-directedness) effectively zero. Of course, they cannot believe anything about the future LLMs (in particular, themselves) if they don’t even possess such a concept in the first place.
I don’t see how excluding LW and AF from the training corpus impacts future ML systems’ knowledge of “their evolutionary lineage”. It would reduce their capabilities in regards to alignment, true, but I don’t see how the exclusion of LW/AF would stop self-referentiality.
The reason I suggested excluding data related to these “ancestral ML systems” (and predicted “descendants”) from the training corpus is because that seemed like an effective way to avoid the “Beliefs about future selves”-problem.
I think I follow your reasoning regarding the political/practical side-effects of such a policy.
Is my idea of filtering to avoid the “Beliefs about future selves”-problem sound?
(Given that the reasoning in your post holds)
I agree, it seems to me that training LLMs in a world virtually devoid of any knowledge of LLMs, in a walled garden where LLMs literally don’t exist, will make their self-evidencing (goal-directedness) effectively zero. Of course, they cannot believe anything about the future LLMs (in particular, themselves) if they don’t even possess such a concept in the first place.