Synthetic data has been gaining a lot of traction recently, but what is it and how could it help your FI’s bottom line? We sat down with Randy Koch, CEO, ARM Insight and Tim Sloane, VP, Payments Innovation at Mercator Advisory Group to discuss how the executive suite often overlooks this important new opportunity for revenue.
ARM Insight is a leading provider of actionable insights from financial data and is safely and securely monetizing data for over 1,000 financial institutions through its innovative synthetic data process.
The Chief Data Officer
So who is the executive often put in charge of an FI’s data? Meet the Chief Data Officer (CDO). Koch sees the CDO as having three primary responsibilities:
- Security – keeping data secure from malicious intent
- Compliance – overseeing regulatory and contractually compliant data
- Monetization – turning data into a stream of revenue through the creation of new data-centered products
Koch tells us that CDOs tend to focus their time and resources on data security and compliance, but spend little time monetizing their data into an additional stream of revenue. Part of this tendency may be due to the perceived risk of violating compliance because executives tend to view data as a whole rather than as segmented parts. According to Koch, the C-suite must partition their data into levels of varying risk to ensure security and satisfy compliance.
The Three Types of Data
You may have seen this breakdown in ARM Insight’s Road Map to Safe Data Monetization, but there are three general categories of data:
- Raw data with Personally Identifiable Information (PII)
- Anonymized data
- Synthetic data
The first type of data is “raw data” and is the riskiest because it contains personally identifiable information (PII). Ok, but what is raw data? Well, think what is collected during a card transaction: account numbers, names, address, timestamps, transaction amount, etc. Together, this data constitutes a security and compliance risk because any breach or internal mishandling reveals personal information about an individual customer. Organizations understand that this data is risky and often employ strict regulatory and compliance requirements, often in collaboration with a compliance officer, to ensure the proper handling of this type.
The second form of data is anonymized. What is anonymized data? Think of the common social science technique to develop pseudonyms of individuals used in survey datasets; Joe becomes John and Sarah becomes Jane. This type of dataset is more secure than raw data, but is not entirely protected from vulnerabilities. In a 2007 paper, researchers Arvind Narayanan and Vitaly Shmatikov demonstrated the ability to use Netflix Prize data with the Internet Movie Database (IMDB) to reconstruct and identify personal information about individual users. This demonstration of a de-anonymization attack showed the significant vulnerabilities inherent in anonymized datasets when only small fractions of information are known about an individual’s identity. Such demonstrations have stimulated research into what Cynthia Dwork, Microsoft Research, and Aaron Roth, University of Pennsylvania, term “differential privacy,” that is, “a promise, made by a data holder, or curator, to a data subject: “You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”” This research promise has gained traction with private sector companies like Google and Apple along with the U.S. Census Bureau. For FIs, datasets most pertinent to business strategy, such as transactional or behavioral data, are not made public to the extent of an internet database such as IMDB, but they are used by internal and external analytic firms, which poses a significant internal risk to compliance.
The third form of data is termed synthetic data. This data type is the most secure, because it contains no PII and no way of reconstructing PII. As the term suggests, the data is completely “artificial” in the sense that the newly created synthetic dataset is unable to be traced back to the original — even by those doing the statistical analysis. Such a data type may be readily packaged and sold as a new product without the compliance or security risk inherent in other types. Sloane explains the use case for monetization, “this data can now be released and run by third parties. There’s no PII data and there’s nothing that can trace it back.” With GDPR, CCPA, and GLBA clouding the industry, using a synthetic dataset allows executives a novel way to impress shareholders with a new revenue stream while ensuring compliance.
Synthetic data means informed decision-making
Let’s say that an organization cleans their data and develops a synthetic dataset to sell to retailers. From a competitive analysis standpoint, a retailer may want to know their position in relation to their competitors for a specific generational segment, or from a behavioral standpoint, if the consumer is transacting online instead of at their brick and mortar location. From a card company perspective, synthetic data can help FIs with the “top of wallet” issue. For example, by using synthetic data to train machine learning algorithms, Koch tells us that those consumers that use a certain card for recurring transactions at places like PayPal, Lyft, Uber and food delivery services will most likely keep that card at the top of their wallet. Insights like these can alter corporate strategy and have a significant influence on the bottom line.
Data is everywhere, and yes there are many rules about data, especially PII. Originators, CDOs and third parties must always work to ensure that customer data is safe and secure, but that does not mean that data cannot be leveraged outside of its original intentions. As Koch says, “Once you have the synthetic data created, with its own data set, that should be fully focused on adding value to the shareholders by creating new revenue streams, new products and running machine learning and AI on top of it. You are now able to take care of the security and risk, but at the same time be very aggressive at how to monetize data and create new products by using synthetic data.”
Maybe it is time to refocus your lens and take another look at the data.