Artificial Intelligence

Ask an FCAT Researcher: David Bracken on Synthetic Data

By Matt Ehlers | November 19, 2024
Share

Generative AI models have been trained by enormous amounts of data scraped off the internet, but as new data becomes scarce, companies are increasingly experimenting with synthetic options.

  • Facebook.
  • Twitter.
  • LinkedIn.
  • Print

FCAT researcher David Bracken focuses on New Business Foundations. He digs into the newest ideas that companies are leveraging to grow revenue and has a special interest in emerging technologies, including the ways in which customers will use them. Through his work, he has researched everything from the impact of memes on our culture to blockchain technologies and social ties in the digital age.

Lately, he has been exploring the opportunities and challenges surrounding synthetic data — a version of existing data that has been altered to remove private and/or personally identifying information.

Q: Why is synthetic data a hot topic right now?

A: The foundational generative AI models currently in-market have largely been trained by the enormous amount of data that companies have scraped off the internet. Now, they are running out of new data to use, which has led to increasing experimentation with synthetic data to solve some of these data scarcity issues.

Synthetic data is not new. Autonomous-driving companies have been using it for some time, and interest also picked up significantly when more stringent privacy laws were passed in Europe about six years ago. Companies began looking into whether synthetic data could help them get around some of these regulations, but generative AI has triggered a new, growing wave of interest in the technology.

Q: Who is most interested in using synthetic data?

A: One of the main reasons that synthetic data is attractive — particularly to companies that are heavily regulated — is that some standard ways of scrubbing data can be reverse-engineered. They are not foolproof. So, organizations are interested in finding better approaches to strip out identifying factors, but in such a way that the data remains valuable for their purposes.

Synthetic data vendors can create new, fully anonymous datasets by training models on the statistical properties of the data without having them memorize any personal information.

Q: Once they have the synthetic data, how do they apply it?

A: There are a lot of different use cases to consider. It can be implemented in places where companies aren’t able to use traditional data or there isn’t enough data to do what they need to do. Fighting fraud is a good example. Organizations might not have much data on a particular kind of fraud, but they want to be able to train their models so that it can be automatically identified in their systems. So, one option is to create synthetic data that looks like the fraud they are trying to catch, which will help their models get better at uncovering potentially fraudulent activity.

Synthetic data can also be used for customer acquisition and onboarding, as well as software testing. Firms are exploring whether they can use synthetic data to help get their software to market faster, since it can expedite access to the production data software engineers need to move projects forward.

Q: Are there examples where synthetic data is not just faster but better?

A: A lot of traditional datasets are problematic because they are not representative of society or the marketplace for a product. This can lead to biased analysis and decision-making. Synthetic data from underrepresented groups can be implemented to help correct for imbalances.

Q: Where can synthetic data use go wrong?

A: Researchers have started to explore what happens when large language models are trained on significant amounts of synthetic data. Some of this research has found that synthetic data can cause the models to rapidly deteriorate — often referred to as model collapse.

Others have been exploring whether synthetic data can introduce more bias into AI models and if it might complicate our understanding and interpretations surrounding generative AI decision-making.

Q: More companies may be experimenting with synthetic data, but how common is it in the overall data market?

A: Most of the forecasts currently peg this market at about $300M annually, and there are a few reasons for this. The main one is that there just aren’t a lot of standards around creating synthetic data. It’s hard to gauge the accuracy of the data being produced. The vendors that help companies create synthetic data each have their own way of evaluating quality, and as standards improve, I think we can expect this market to grow significantly.

  • Facebook.
  • Twitter.
  • LinkedIn.
  • Print
1176567.1.0
close
Please enter a valid e-mail address
Please enter a valid e-mail address
Important legal information about the e-mail you will be sending. By using this service, you agree to input your real e-mail address and only send it to people you know. It is a violation of law in some jurisdictions to falsely identify yourself in an e-mail. All information you provide will be used by Fidelity solely for the purpose of sending the e-mail on your behalf.The subject line of the e-mail you send will be "Fidelity.com: "

Your e-mail has been sent.
close

Your e-mail has been sent.

Related Articles

Artificial Intelligence
By: Elton Zhu and Serdar Kadıoğlu | November 26, 2024
FCAT collaborated with Amazon Quantum Solutions Lab to propose a new scalable pruning algorithm for large language and vision models.
11/26/2024
Article
Artificial Intelligence
By: Matt Ehlers | November 19, 2024
Generative AI models have been trained by an enormous amount of data scraped off the internet, but as new data becomes scarce, companies are increasingly experimenting with synthetic data.
11/21/2024
Article
Artificial Intelligence
John Dalton| October 25, 2024
In Superconvergence, Jamie Metzl explains how emerging genetic, biotechnical, and AI technologies will transform our world. FCAT’s John Dalton spoke with Metzl about his research and how he hopes we can move forward as a society.
10/25/2024
Article

This website is operated by Fidelity Center for Applied Technology LLC (FCAT®). FCAT experiments with and provides innovative products, services, content and tools, as a service to its affiliates and as a subsidiary of FMR LLC. Based on input and feedback, FCAT is better able to engage in technology research and planning for the Fidelity family of companies. Unless otherwise indicated, the information and items presented are provided by FCAT and are not intended to provide tax, legal, insurance or investment advice and should not be construed as an offer to sell, a solicitation of an offer to buy, or a recommendation for any security by any Fidelity entity or any third-party. Third-party trademarks and service marks are the property of their respective owners. All other trademarks and service marks are the property of FMR LLC or its affiliated companies.


1150441.2.0


This is for persons in the U.S. only.


245 Summer St, Boston MA

© 2008-2024 FMR LLC All right reserved | FCATalyst.com


Terms of Use | Privacy | Security | DAT Support