In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors?
I think this is an important question to ask, but “even in the hands of bad actors” is just too difficult a place to start. I’m sure you’re aware, but it’s an unsolved problem whether there exists a dataset / architecture / training procedure such that “generating extrapolations from it leads to good outcomes,” for sufficiently capable ML models, even in the hands of good actors. (And the “bad actor” piece can at least plausibly be solved by social coordination, whereas the remaining portion is a purely technical problem you can’t dodge.)
But if you drop the bad actor part, I think this question is a good one to ask (but still difficult)! I think answering this question requires a better understanding of how neural networks generalize, but I can at least see worlds where the answer is “yes”. (Though there are still pitfalls in how you instantiate this in reality—does your dataset need to be perfectly annotated, so that truth-telling is incentivized over sycophancy/deception? Does it require SGD to always converge to the same generalization behavior? etc.)
Ok, let’s assume good actors all around. Imagine we have a million good people volunteering to generate/annotate/curate the dataset, and the eventual user of the AI will also be a good person. What should we tell these million people, what kind of dataset should they make?
Spitballing here, the key question to me seems to be about the OOD generalization behavior of ML models. Models that receive similarly low loss on the training distribution still have many different ways they can behave on real inputs, so we need to know what generalization strategies are likely to be learned for a given architecture, training procedure, and dataset. There is someevidence in this direction, suggesting that ML models are biased towards a simplicity prior over generalization strategies.
If this is true, then the incredibly handwave-y solution is to just create a dataset where the simplest (good) process for estimating labels is to emulate an aligned human. At first pass this actually looks quite easy—it’s basically what we’re doing with language models already.
Unfortunately there’s quite a lot we swept under the rug. In particular this may not scale up as models get more powerful—the prior towards simplicity can be overcome if it results in lower loss, and if the dataset contains some labels that humans unknowingly rated incorrectly, the best process for estimating labels involves saying what humans believe is true rather than what actually is. This can already be seen with the sycophancy problems today’s LLMs are having.
There’s a lot of other thorny problems in this vein that you can come up with with a few minutes of thinking. That being said, it doesn’t seem completely doomed to me! There just needs to be a lot more work here. (But I haven’t spent too long thinking about this, so I could be wrong.)
I think this is an important question to ask, but “even in the hands of bad actors” is just too difficult a place to start. I’m sure you’re aware, but it’s an unsolved problem whether there exists a dataset / architecture / training procedure such that “generating extrapolations from it leads to good outcomes,” for sufficiently capable ML models, even in the hands of good actors. (And the “bad actor” piece can at least plausibly be solved by social coordination, whereas the remaining portion is a purely technical problem you can’t dodge.)
But if you drop the bad actor part, I think this question is a good one to ask (but still difficult)! I think answering this question requires a better understanding of how neural networks generalize, but I can at least see worlds where the answer is “yes”. (Though there are still pitfalls in how you instantiate this in reality—does your dataset need to be perfectly annotated, so that truth-telling is incentivized over sycophancy/deception? Does it require SGD to always converge to the same generalization behavior? etc.)
Ok, let’s assume good actors all around. Imagine we have a million good people volunteering to generate/annotate/curate the dataset, and the eventual user of the AI will also be a good person. What should we tell these million people, what kind of dataset should they make?
To be clear, I don’t know the answer to this!
Spitballing here, the key question to me seems to be about the OOD generalization behavior of ML models. Models that receive similarly low loss on the training distribution still have many different ways they can behave on real inputs, so we need to know what generalization strategies are likely to be learned for a given architecture, training procedure, and dataset. There is some evidence in this direction, suggesting that ML models are biased towards a simplicity prior over generalization strategies.
If this is true, then the incredibly handwave-y solution is to just create a dataset where the simplest (good) process for estimating labels is to emulate an aligned human. At first pass this actually looks quite easy—it’s basically what we’re doing with language models already.
Unfortunately there’s quite a lot we swept under the rug. In particular this may not scale up as models get more powerful—the prior towards simplicity can be overcome if it results in lower loss, and if the dataset contains some labels that humans unknowingly rated incorrectly, the best process for estimating labels involves saying what humans believe is true rather than what actually is. This can already be seen with the sycophancy problems today’s LLMs are having.
There’s a lot of other thorny problems in this vein that you can come up with with a few minutes of thinking. That being said, it doesn’t seem completely doomed to me! There just needs to be a lot more work here. (But I haven’t spent too long thinking about this, so I could be wrong.)