I’m very interested in this agenda—I believe this is one of the many hard problems one needs to make progress on to make optimization-steering models a workable path to an aligned foom.
Improve the model’s epistemology first. This then allows the model to reduce its own biases, preventing bias amplification. This also solves the positive feedback loops problem.
Data poisoning is still a problem worth some manual effort in preventing the AI from being exposed to adverserial inputs that break its cognition given its capabilities level, but assuming the model improves its epistemology, it can also reduce the effects of data poisoning on itself.
Semantic drift is less relevant than most people think. I believe that epistemic legibility is overrated, and that while it is important for an AI to be able to communicate coherent and correct reasoning for its decisions, I expect that the AI can actively try to red-team and correct for any semantic drift in situations where coherent and correct reasoning is necessary for its communication of its reasoning.
Cross-modal semantic grounding seems more of a capabilities problem than and alignment problem and I think that this problem can be delegated to the AI itself as it increases its capabilities.
Value drift is an important problem and I roughly agree with the list of problems you have specified in the sub-section. I do believe that this can also be mostly delegated to the AI though, given that we use non-value-laden approaches to increase the AI’s capabilities until it can help with alignment research.
I’m very interested in this agenda—I believe this is one of the many hard problems one needs to make progress on to make optimization-steering models a workable path to an aligned foom.
I have slightly different thoughts on how we can and should solve the problems listed in the “Risks of data driven improvement processes” section:
Improve the model’s epistemology first. This then allows the model to reduce its own biases, preventing bias amplification. This also solves the positive feedback loops problem.
Data poisoning is still a problem worth some manual effort in preventing the AI from being exposed to adverserial inputs that break its cognition given its capabilities level, but assuming the model improves its epistemology, it can also reduce the effects of data poisoning on itself.
Semantic drift is less relevant than most people think. I believe that epistemic legibility is overrated, and that while it is important for an AI to be able to communicate coherent and correct reasoning for its decisions, I expect that the AI can actively try to red-team and correct for any semantic drift in situations where coherent and correct reasoning is necessary for its communication of its reasoning.
Cross-modal semantic grounding seems more of a capabilities problem than and alignment problem and I think that this problem can be delegated to the AI itself as it increases its capabilities.
Value drift is an important problem and I roughly agree with the list of problems you have specified in the sub-section. I do believe that this can also be mostly delegated to the AI though, given that we use non-value-laden approaches to increase the AI’s capabilities until it can help with alignment research.