There is also the question of using semi-synthetic data to train other models. Whilst using synthetic data has proven successful in other domains, notably computer vision, attempting it with language may prove more difficult due to examples being less robust to modification.
That’s true for some language tasks but not for others. As long as you need human judgment to evaluate the generated language it’s
There seems to be a model that checks whether the outputs of ChatGPT violate content rules and then adds those rule-violating examples into the training set for ChatGPT. While there are humans in the loop that prompt ChatGPT and provide some direction and try to find edge cases, but this is essentially semi-synthethic data generation.
If I would be OpenAI then I would put a lot of attention on trying to figure out how to convert examples where ChatGPT does make content errors and notices it because of the answer of the human chatting with it to be new training data.
One of the reasons why OpenAI made ChatGPT freely available to everyone despite the huge compute costs might be that they want it to be used to get more training data.
When an ChatGPT like system gets linked to a console and told to do multi-step tasks, it will sometimes fail with the tasks and likely be able to frequently tell when it fails. Once it has a valid 20-step way to solve a given task it could also synthesize a 10-step way to solve the same task and create training data out of it.
One possible task might be: Take all the biomedical literature and make Wikidata statements out of it. Over at Google it would be “make Knowledge graph statements out of it”.
If the UK prime minister is fanatical about scaling LMs, they might be interested in tapping every call and using Whisper to transcribe what was said. With a harsh discount rate of 1%, we get an estimate of 681B tokens per year. That’s an insane amount of data for a country with a relatively small population, and by factoring in other online platforms we could easily double this.
That assumes the UK only taps UK calls. Historically the UK tapped the US calls and the US tapped the UK calls and then they exchanged data so that no intelligence service would violate the domestic laws.
I would expect that there’s a Five Eye project going on to train a model that does use decades of international call data.
That’s true for some language tasks but not for others. As long as you need human judgment to evaluate the generated language it’s
There seems to be a model that checks whether the outputs of ChatGPT violate content rules and then adds those rule-violating examples into the training set for ChatGPT. While there are humans in the loop that prompt ChatGPT and provide some direction and try to find edge cases, but this is essentially semi-synthethic data generation.
If I would be OpenAI then I would put a lot of attention on trying to figure out how to convert examples where ChatGPT does make content errors and notices it because of the answer of the human chatting with it to be new training data.
One of the reasons why OpenAI made ChatGPT freely available to everyone despite the huge compute costs might be that they want it to be used to get more training data.
When an ChatGPT like system gets linked to a console and told to do multi-step tasks, it will sometimes fail with the tasks and likely be able to frequently tell when it fails. Once it has a valid 20-step way to solve a given task it could also synthesize a 10-step way to solve the same task and create training data out of it.
One possible task might be: Take all the biomedical literature and make Wikidata statements out of it. Over at Google it would be “make Knowledge graph statements out of it”.
That assumes the UK only taps UK calls. Historically the UK tapped the US calls and the US tapped the UK calls and then they exchanged data so that no intelligence service would violate the domestic laws.
I would expect that there’s a Five Eye project going on to train a model that does use decades of international call data.