Exploiting Data Network Effect Securely through Data Clean Rooms
In a mall, all shops contribute to its overall footfall. This is called the network effect. Similarly, in the data context, every entity that generates data contributes more value to the network. To exploit the data network effects in an industry, we must:
- Upload data to the cloud.
- Make this data available to others for analysis
- Ensure data privacy and protection of personal information
Today, most companies are adopting public clouds to leverage this effect.
Snowflake Data Clean Room
Snowflake has created a Cloud Data Platform for data commerce. This platform gives users access to data within or outside accounts through:
- Role-based access control
- Row-level security
- Column data masking
Snowflake replicates this data across regions and cloud providers. Users must access read replicas within the same region and public cloud. This reduces network latency. Data does not move outside an organization’s boundaries. So managing and governing it is easier.
Enterprises spend 4 or 5 dollars on services for every dollar spent on Snowflake. The bulk of this is on human resources. Let us start by setting ourselves a goal of doing better on the cost front. Snowflake works on pay-per-use. If you do not use it, you do not pay.
A Snowflake service partner can help businesses minimize usage and identify use cases that bring you the most value. Let me elaborate with an example. I have chosen a case from advertising. Every business will need to advertise on some medium. This is an era of mass personalization. However, protecting customer privacy and compliance with regulations is just as important.
Unleashing data network effects in the advertising
Customers transact on the internet with multiple parties. They want the parties to know them to make their experience personal. Customers may object if details of their transactions are given to third parties -i.e., parties not involved in the transaction.
Traditionally, customers were identified by placing cookies on the browser. But due to greater stress on customer privacy, Google Chrome has announced the discontinuance of cookies in Chrome browser. Also, regulations are becoming stricter regarding how personal information must be handled. Data clean room solutions are emerging as one of the most popular privacy-enhanced technology to facilitate data sharing and collaboration.
Let’s say I am a Disney customer. I will most likely associate with a particular advert. Disney determines this association based on my prior usage. I don’t want third parties to know what I do on the Disney application. I may watch only adult content or cartoons, which is none of anyone’s business.
That said, it is Disney’s business to maximize revenue from adverts. In this example, Disney has data about every customer’s:
- Favorite show
- Maximum association towards available adverts.
Can Disney share this data in near real-time without showing it?
What do we mean by sharing data without showing it? Limit or control the questions you ask. If we restrict questions, data is as good as hidden because raw data is encrypted.
Let us say unencrypted data looks like what is shown in the table below. My record is one row. Imagine a million rows for other Disney customers. Let’s share this dataset.
Name | Favorite Show | Max Association with Ad |
Sumukh (Me) | Baywatch | Nike Shoes |
The data set owner (Disney) can restrict questions (queries) asked. If the question is – how many customers have:
- Baywatch as their favorite show AND
- Have the maximum association with Nike Shoes advert
Then Disney can do either of these two:
- Allow this question
- Decide to put conditions even after allowing the question. Like, revealing the answer only if the count is more than fifty. This will avoid reidentification.
Let’s say a third-party advertiser asked the allowed question and the answer is 1000 entries (i.e., those who watch Baywatch and have an affinity towards Nike’s shoe ad); then:
- The party may be willing to pay a premium for ads to these 1000
- Define success if the customer visits the store within five days
Disney will play the ad for 1000 customers like me. Three days later, if I visit the brand’s store, the brand will know that I visited its store, but it won’t know if I saw the ad or not.
In Disney’s data set of ads shown, my name will exist. A thousand other names will exist. In the brand’s data set of store visitors, my name will exist along with, say, 2000 others. We ask for a count of overlapping names and allow the same. We know how many saw the ad and came, but not who saw the ad and came.
Advertisers will not know if i watched Baywatch or the Nike ad on Disney’s platform. Disney will not know if I visited a Nike store. Advertisers can run a targeted campaign on a segment. They can measure the effectiveness of that campaign. In this solution, we didn’t reveal any personal information and got joint insights from the data. Let’s examine how we will implement this.
Implementing the data clean room solutions
Forbes says that every company is a software company. What does this mean to service partners of Snowflake like us? We create software to implement a specific solution at scale. We will provide services as software. You read it right. Service as software and not software as service. This will comprise the following.
- An application to configure Snowflake accounts. A distributed clean room in the case above.
- A self-service, business-friendly user interface for setting up rules. Which column to show or hide? Which column to aggregate (count of, sum of, mean of, etc.)? Which column to use as a common key between one data set and another? In our example, we cannot show the column “Name”. We can reveal the column “Favorite show” and the Aggregate Count of “Name.” We can join the two datasets on the column “Name.” We will set up these rules using a business-friendly interface. These rules will translate into query templates in Snowflake. This will scale as we can change or add rules without going to the IT department.
- Alerts/messages about datasets shared and linked rules through SMS or email.
- Stored procedures to validate query requests against the rules. Once before we send the request and once when received.
- We can also use Snowflake’s data masking to mask data when necessary. Show only two digits of a phone number and mask the rest.
- An audit dashboard for showing which queries ran, who ran them, and when.
LTIMindtree has built a Streamlit application that can, in the future, run natively on Snowflake to implement the above solution at scale.
LTIMindtree’s clean room solution
If you do not have data in Snowflake and want to upload it from your existing data warehouse to Snowflake, LTIMindtree has a solution for that as well.
More from Sumukh Guruprasad
Harnessing the immense power of the cloud with the Snowflake’s platform.When I was growing…
Latest Blogs
The business world is moving quickly and the only way to make informed decisions is to leverage…
As businesses turn to cloud services to meet their growing technology needs, the promise of…
Clinical trials are at the heart of drug development, producing vast, complex datasets that…
The rise of machine customers introduces essential questions that stretch our technological…