why, concretely, isn’t it how it works? isn’t transformative ai most likely to be raised in a memetic environment and heavily shaped by that? isn’t interpretability explicitly for the purpose of checking what effect training data had on an ai? it doesn’t seem to me that training corpus is an irrelevant question. We should be spending lots of thought on how to design good training corpora, eg LOVE in a simbox, which is explicitly a design for generating an enormous amount of extremely safe training data.
Well, yes, it will be shaped by what it learns, but it’s not what you train it on that matters, since you don’t get to limit the inputs and then hope it only learns “good behavior”. Human values are vague and complex, and not something to optimize for, but more of a rough guardrail. All of human output, informational and physical, is relevant here. “Safe” training data is asking for trouble once the AI learns that the real world is not at all like training data.
why, concretely, isn’t it how it works? isn’t transformative ai most likely to be raised in a memetic environment and heavily shaped by that? isn’t interpretability explicitly for the purpose of checking what effect training data had on an ai? it doesn’t seem to me that training corpus is an irrelevant question. We should be spending lots of thought on how to design good training corpora, eg LOVE in a simbox, which is explicitly a design for generating an enormous amount of extremely safe training data.
Well, yes, it will be shaped by what it learns, but it’s not what you train it on that matters, since you don’t get to limit the inputs and then hope it only learns “good behavior”. Human values are vague and complex, and not something to optimize for, but more of a rough guardrail. All of human output, informational and physical, is relevant here. “Safe” training data is asking for trouble once the AI learns that the real world is not at all like training data.