Continuing the Conversation: Mayana Pereira’s presentation and tutorial on Enhancing Open Data for Social Good

Featured at the 2024 OpenDP Community Meeting, Mayana Pereira presented new work that explores the potential of OpenDP in enabling open data for social good. Focusing on the critical aspect of digital equity, she spoke about open datasets created by Microsoft’s AI for Good team. Her full recorded presentation can be found here.

Referring to the work done in 2020 utilizing a broadband dataset compiled to measure broadband usage at the zip code level across the United States, it was one of the first opportunities to use differential privacy methods (and the OpenDP Library) to help provide invaluable insights into the accessibility and utilization of broadband services. This type of information and data driven reporting shed light on the digital divide that exists in different regions in the U.S., especially the differences between urban and rural U.S.

The success of the broadband project led to the upcoming Digital Applications Index dataset, set to be published by the Microsoft AI for Good team in 2024 (the paper on this work is under journal submission). Mayana explained how this differentially private dataset offers a unique perspective by presenting metrics on the usage of digital applications at the zip code level. By leveraging the OpenDP library, this dataset has the potential to empower researchers and policymakers to drive impactful change in areas such as economic development, education, and socio-economic analysis.

Utilizing the OpenDP Library as a trusted resource allowed Microsoft to complete these types of projects in a privacy preserving manner and has opened the door to more projects where data was previously deemed too sensitive to work with. Publications of differentially private open data sets are usually performed by highly specialized teams in big tech companies such as Microsoft, Google, Meta, and LinkedIn. OpenDP tools have the potential of simplifying and speeding up data publication.

Due to the time constraints for Mayana’s talk at the 2024 OpenDP Community Meeting, she could not dive deeper into how new OpenDP functionalities, such as the additive noise mechanisms and the DP PCA recently integrated into OpenDP library, can democratize differentially private data publication.

In this post, we will show a simple tutorial describing the process of adding differential privacy to the United States Broadband Usage Percentages Dataset. The tutorial will go through all steps of a data release project, from defining all data aggregations and computations in a non-private setting and identifying the points in the pipeline where data transformations, such as clamping, are necessary, all the way to implementing the data pipeline with the appropriate differential privacy mechanism using OpenDP.

Utilizing OpenDP’s Additive Noise Mechanism to Create ZIP Code level aggregates for the United States Broadband Usage Percentages Dataset

The Broadband Coverage Estimates (BCE) dataset is derived from Microsoft’s telemetry data. The BCE dataset provides differentially private estimates of the percentage of households, in each ZIP code in the United States, which has access to the internet at broadband speed, i.e. internet speeds over 25 Mbps.

In this tutorial we will describe how to calculate the differentially private estimates of broadband coverage.

Let’s first understand how to compute the non-private version of the BCE data.

For each ZIP code z, the following variables are present in Microsoft’s telemetry systems:

: Count of windows devices with speeds of 25 Mbps or more.
: Count of windows devices with speeds less than 25 Mbps.
: Count of devices utilizing Microsoft Services in zip code z.
: Count of devices not utilizing Microsoft Services in zip code z.

The BCE calculation for each zip code is outlined as follows:

BCE(z) = H_z \cdot(\frac{M_z}{M_z+O_z})^{-1}\cdot\frac{1}{HUD_z}
%ad16b206-81e3-477a-9f40-4d0f71a654a1

where HUD_z
%abec53e4-7940-405e-98d6-0b9783102cd7 is the number of households in that zip code, obtained from public datasets such as the HUD data set. The process of transforming the data present in Microsoft’s telemetry systems into the Broadband Coverage Estimates dataset is illustrated in Figure 1.

Figure 1: Process describing the generation of the Broadband Coverage estimates dataset.

We adapt the process to transform the BCE dataset into a differentially private dataset by following these steps:

Identify the Unit of Privacy: This data release ensures device-level privacy guarantees. While we recognize that device-level privacy may not fully correspond to individual privacy for every Windows user, it provides a reasonable level of protection for the typical user.
Define Privacy Loss: The total privacy loss for this data release is set to ε = 0.2, adhering to the principles of pure differential privacy.
Define Domain Descriptors: This data release will query data from two data sources. The first data source contains two columns: ZIP code and speed; and the second data source contains two columns: ZIP code and OS.
Define a Context: Sensitive data is identified and securely placed behind an OpenDP compositor, which mediates all access to the data and ensures compliance with differential privacy requirements.

We highlight step 3 in Figure 2.

Figure 2: Counts obtained from Microsoft’s telemetry can expose sensitive data.

To create a differentially private dataset, sensitive data is processed exclusively through the OpenDP compositor. One key transformation performed within the DP compositor is ensuring that the ZIP code domain descriptor in the dataset is not derived from the sensitive data itself but rather from a public data source. For this purpose, census data is used to define the domain descriptor for zip codes included in the data release.