
March 4, 2025
Last year, the UNHCR, the United Nations Refugee Agency, reached out to OpenDP looking to hire a consultant for 6 months to test the OpenDP libraries with their microdata. Sharing the opportunity widely with the OpenDP Community, we were pleased that Nitin Kohli from the University of California, Berkeley took on the assignment. Despite the challenges due to the particular structure of humanitarian data, Nitin found success experimenting with a new approach of using existing OpenDP libraries.
Challenges
UNHCR is committed to open data and, since January 2020, has been sharing on its Microdata Library datasets anonymized with statistical disclosure control methods. This approach showed some limitations, in particular when dealing with census data because it required sampling to reduce privacy risks to an acceptable level, which significantly diminishes the utility of the shared data since only a small subset may be published.
UNHCR is currently focused on sharing registration data, comprehensive datasets containing critical information about all displaced people registered with the organization. These datasets hold immense potential for university researchers, non-profits, NGOs, and other organizations to better understand the current state and historical trends of forced displacement as they offer details on refugee demographics, received assistance, protection needs, skills, work experience, and other critical factors. By enabling access to such data, UNHCR can foster research and analysis to inform impactful policies and programs, ultimately benefiting displaced populations worldwide.
To overcome the limitations posed by the previous approach and explore innovative privacy technologies, UNHCR tested the application of differential privacy on registration data, with the primary objective of developing a generalized approach for releasing full-size, differentially private synthetic registration datasets across multiple countries. This approach aimed to preserve privacy while improving the utility of the released data but proved to be challenging due to the complex relational structure across multiple tables and the presence of mixed data types.
For a given country, the registration data consists of 12 tables and is stored in a relational database (shown above). When individuals register together with UNHCR, they are assigned into the same registration group (e.g., a family could be a registration group, or just a single individual); this information is contained in the Registration Group table. When a registration group receives assistance or entitlement cards, this event-level information is logged in the respective tables. Additional information is collected about the individuals themselves in the Individual Table (e.g., their age, gender, country of origin, time spent fleeing, etc.); additional information is also collected in the child tables of the Individual table. For example, if an individual has worked multiple jobs, then multiple rows in the Work Experience table may be associated with them.
The multiplicity of the records pertaining to individual and registration groups posed challenges for synthesization.
- Given the nature of the data, it was not reasonable to represent the data in a single table; hence, the data could not simply be merged together into a single table, synthesized, and the output split into synthetic tables afterwards. This meant that the synthesization strategy had to work over multiple tables, while including primary and foreign keys in their output.
- Complicating matters, there is typically heterogeneity in the number of times each foreign key appears in a table, and this distribution is often right skewed. This long-tailed behavior meant that synthesizing a specific sub-table for each foreign id was infeasible, as there could usually be “too little” records to synthesize.
- Additionally, the computation time required to synthesize these tables varied from country to country, based on the number of records in their tables. For example, some countries contained thousands of records on individuals, while others contained millions. This meant that the synthesization process should be scalable (and ideally parallelizable) to facilitate computations on larger tables.
Approach
To overcome these challenges, Nitin utilized OpenDP’s Smart Noise MST synthesizer as part of a larger approach to generate differentially private relational synthetic humanitarian microdata. The larger approach, called OSAT (Oversample-and-Trim), used the occurrences of foreign keys in the table to determine how to include them in the final synthetic output. As such, the accuracy of within-table information is governed by the MST algorithm, and the accuracy of across-table information is governed by both the MST algorithm as well as OSAT’s approach to assigning primary and foreign keys to the tables.
In order to implement the approach, Nitin relied on a technology stack that included Microsoft Azure + Databricks, Python + Spark, and OpenDP and Microsoft’s SmartNoise MST Synthesizer. He wanted to build on top of the OpenDP libraries since it was open sourced and has a strong and active research and development community – which ensured stability and resiliency of code over time.
Results and Conclusion
By doing so, Nitin was able to generate differentially private synthetic tables for each country that included primary and foreign keys and tested the accuracy of these tables by measuring the error between the one-way and two-way marginals within each table (within-table accuracy), and the two-way marginals of joined tables (across-table accuracy). Nitin found that many synthetic tables had high within- and across-table accuracy (on average), with the within-table accuracy typically exceeding the across-table accuracy.
Relative to the prior methods of statistical disclosure control mentioned above, Nitin also found that the new tables provided a stronger privacy guarantee (thanks to differential privacy) and higher levels of accuracy. For example, in the case of an average size dataset, the MST algorithm (with strong privacy parameters of epsilon and delta) not only reduced the mean absolute error of the categorical distribution of each category but also enabled more categories to be released, thereby enabling more fine-grained use cases.
Individual Table with (0.125,1×10-10)-DP at the Individual Level
Over the course of the project and utilizing different approaches, Nitin was able to use the OpenDP library and techniques from differential privacy to:
- Release full-size synthetic tables with strong privacy properties
- Release more richer and more accurate data compared to previously used statistical disclosure methods
- Approximately preserve distributional characteristics of the original data, thereby facilitating statistical tasks and downstream applications.
After going through internal vetting, the registration datasets for all available countries will be published on the UNHCR Microdata Library where researchers will be able to request and download the data for their own work. Nitin also wrote a public guidance to apply the same approach to similar datasets, so that other organizations can leverage the potential of differential privacy and OpenDP libraries to safely share their data. The guidance can be freely downloaded at the following link: https://microdata.unhcr.org/index.php/synthetic-data.