The next wave of model improvements will be due to data quality

Wait 5 sec.

Published on June 28, 2025 2:34 PM GMTWhen talking about why the models improve, people frequently focus on algorithmic improvements and on hardware improvements. This leaves out improvements in data quality.DeepResearch provides a huge quantity of high quality analysis reports. GPT5 will almost certainly be trained on all the DeepResearch requests (where there are no reason to believe they are wrong like bad user feedback) of users that haven't opted out of their data being used for training. This means that when users ask GPT5 questions where no human has written an anlaysis of the question, GPT5 might still get facts right because of past DeepResearch reports.This means that the big companies who do have a massive amount of users that produce a massive amount of DeepResearch requests will have a leg-up that's hard to copy for smaller players who don't have as many DeepResearch reports.OpenAI Operator asks the user after every task it does whether or not it successfully did the task. This is very valuable training data that will improve OpenAI operator even without additional algorithmic improvements. OpenAI Codex is at the moment in a mode where the user can let Codex run a task four times in parallel. If the user picks then one of the results for a PR or no result for a PR that's a very valuable training signal. Even which accepted PR's it's possible to gain more information about the quality of the PR by looking at whether the new code is long-living or gets soon deleted for being low quality. The present models are mostly not trained with real-world feedback data of the quality that OpenAI Operator and OpenAI Codex are now able to provide. Of course Google and Anthropic have their own version of these featurs that will provide them with data as well. Discuss