- 11/21/2024
FCAT researcher David Bracken focuses on New Business Foundations. He digs into the newest ideas that companies are leveraging to grow revenue and has a special interest in emerging technologies, including the ways in which customers will use them. Through his work, he has researched everything from the impact of memes on our culture to blockchain technologies and social ties in the digital age.
Lately, he has been exploring the opportunities and challenges surrounding synthetic data — a version of existing data that has been altered to remove private and/or personally identifying information.
Q: Why is synthetic data a hot topic right now?
A: The foundational generative AI models currently in-market have largely been trained by the enormous amount of data that companies have scraped off the internet. Now, they are running out of new data to use, which has led to increasing experimentation with synthetic data to solve some of these data scarcity issues.
Synthetic data is not new. Autonomous-driving companies have been using it for some time, and interest also picked up significantly when more stringent privacy laws were passed in Europe about six years ago. Companies began looking into whether synthetic data could help them get around some of these regulations, but generative AI has triggered a new, growing wave of interest in the technology.
Q: Who is most interested in using synthetic data?
A: One of the main reasons that synthetic data is attractive — particularly to companies that are heavily regulated — is that some standard ways of scrubbing data can be reverse-engineered. They are not foolproof. So, organizations are interested in finding better approaches to strip out identifying factors, but in such a way that the data remains valuable for their purposes.
Synthetic data vendors can create new, fully anonymous datasets by training models on the statistical properties of the data without having them memorize any personal information.
Q: Once they have the synthetic data, how do they apply it?
A: There are a lot of different use cases to consider. It can be implemented in places where companies aren’t able to use traditional data or there isn’t enough data to do what they need to do. Fighting fraud is a good example. Organizations might not have much data on a particular kind of fraud, but they want to be able to train their models so that it can be automatically identified in their systems. So, one option is to create synthetic data that looks like the fraud they are trying to catch, which will help their models get better at uncovering potentially fraudulent activity.
Synthetic data can also be used for customer acquisition and onboarding, as well as software testing. Firms are exploring whether they can use synthetic data to help get their software to market faster, since it can expedite access to the production data software engineers need to move projects forward.
Q: Are there examples where synthetic data is not just faster but better?
A: A lot of traditional datasets are problematic because they are not representative of society or the marketplace for a product. This can lead to biased analysis and decision-making. Synthetic data from underrepresented groups can be implemented to help correct for imbalances.
Q: Where can synthetic data use go wrong?
A: Researchers have started to explore what happens when large language models are trained on significant amounts of synthetic data. Some of this research has found that synthetic data can cause the models to rapidly deteriorate — often referred to as model collapse.
Others have been exploring whether synthetic data can introduce more bias into AI models and if it might complicate our understanding and interpretations surrounding generative AI decision-making.
Q: More companies may be experimenting with synthetic data, but how common is it in the overall data market?
A: Most of the forecasts currently peg this market at about $300M annually, and there are a few reasons for this. The main one is that there just aren’t a lot of standards around creating synthetic data. It’s hard to gauge the accuracy of the data being produced. The vendors that help companies create synthetic data each have their own way of evaluating quality, and as standards improve, I think we can expect this market to grow significantly.