Announcing OpenDP Library 0.14

We’re happy to announce v0.14 of the OpenDP Library!

The OpenDP Library is a modular collection of algorithms for building privacy-preserving applications.

This release has a number of features that make common analyses easier and more idiomatic, including identifier truncation, synthetic data generation, and linear regression, as well as enhancements to the framework like odometers and additions to the suite of core differentially private mechanisms.

We want to hear from you! The OpenDP library is improved with your feedback and contributions, so please let us know what you think via our slack channel or emailing us directly at info@opendp.org!

Fully Adaptive Composition

The OpenDP Library now supports fully adaptive composition, where both the mechanisms and privacy loss parameters are chosen adaptively. This is in contrast to adaptive composition, where the privacy parameters are fixed prior to beginning the analysis. As a result, the OpenDP Context API can now be initialized with just your data, privacy unit, and privacy loss: the number of queries no longer needs to be specified up-front. You then specify the privacy loss (or expected utility) for each query as you go, and you can check the accumulated privacy loss at any time.

Hierarchical Queries and Synthetic Data

OpenDP v0.14.0 also adds support for hierarchical queries and synthetic data. This functionality is made available in a new ContingencyTable API, which approximates the counts of records when grouped by all columns in the data. A ContingencyTable can be projected to marginal counts over a subset of grouping columns, or used to generate synthetic data. OpenDP estimates lower-order marginals via Polars, and then post-processes the marginals with mbi (private-pgm) to fit a joint distribution over all columns. A ContingencyTable can be adaptively updated to incorporate additional marginals from a fixed query workload, AIM, or MST.

Thanks to Maxine Park for implementing AIM and Shlomi Hod for implementing MWEM.

Identifier Truncation

A DP analysis begins by identifying the unit of privacy. Formerly, when working with microdata, you were required to specify the maximum number of rows that any individual could contribute to a dataset. In practice, this number may not be known, or a few outliers might have outsized contributions, resulting in much more noise. Now there’s a better option, if you have an identifier column: You can specify that identifier column in your unit of privacy, and then preprocess the data in your query to limit the contributions from individuals. Read more about identifier truncation in the docs.

Subsample and Aggregate

OpenDP now supports group-by and agg transformations in Polars pipelines, under the subsample and aggregate framework. OpenDP typically only allows stable transformations to be used, but under subsample and aggregate, expressions need only be infallible, and thus many more expressions can be used. Read more about subsample and aggregate in the docs.

Polars 0.50

OpenDP 0.14 now supports Polars 50.0. The biggest new feature is a rewritten streaming engine for out-of-core computations. This rewritten streaming engine supports expressions present in common DP data processing pipelines.

Linear Regression

We’ve also added univariate linear regression to the scikit-learn-style API. This was present as an example in the documentation before, but we have standardized the interface and incorporated it into the core library.

Core Mechanisms

Odometers and Privacy Filters

The OpenDP Library now has three fundamental computing abstractions: transformations, measurements, and odometers. Fully adaptive composition is one example of a constructor for an odometer. An odometer can be converted into a measurement by enforcing an upper bound on the privacy loss, thus making a privacy filter. When an odometer or privacy filter is invoked with data, a queryable is spawned that accumulates the privacy loss as an analyst submits queries (in a similar manner as adaptive composition). Privacy filters and odometers can be used as building blocks for differentially private algorithm implementations. An example of this is the AIM algorithm for synthetic data, which internally uses a privacy filter to accumulate privacy loss over adaptively chosen marginals.

Canonical Noise

OpenDP now supports an additive noise mechanism that samples from the canonical noise distribution. The canonical noise distribution allows for the construction of uniformly most powerful (UMP) tests for binary data (as in the case of counting queries). Read more in the documentation.

Thanks to contributions from Aishwarya Ramasethu, Ruby Ku and Jordan Awan.

Thresholded Noise Mechanisms

A “thresholded” noise mechanism is a noise mechanism that is applied over pairs of sensitive keys and numeric values. The mechanism adds noise to the numeric values, and only retains pairs whose value exceeds a given threshold. This is the underlying mechanism for stability histograms. OpenDP now supports both integer and floating-point values, as well as laplace and gaussian noise. The distance between adjacent input datasets is now represented as a triple containing the L0, Lp and L-infinity sensitivities. Read more in the documentation.

Permute and Flip

OpenDP now has a make_noisy_max mechanism that can be parameterized with a privacy measure. Gumbel noise is added to scores when parameterized with zCDP, and exponential noise is effectively added to scores when parameterized with pure-DP. Since the permute and flip mechanism is equivalent to report noisy max exponential, and since the permute and flip mechanism is entirely discrete, and is thus much more computationally efficient than simulating continuous exponential distributions, the noisy max mechanism under pure-DP executes the permute-and-flip mechanism.

Thanks to Tudor Cebere for contributing an implementation and proof.

Report Noisy Top K

Similarly, OpenDP now also has a make_noisy_top_k mechanism that returns the indices of the top k largest inputs. When parameterized with zCDP, the mechanism executes in one shot, avoiding a linear-time computational overhead in the number of selected indices.

Parametrizable Privacy Definitions

By convention, many OpenDP mechanisms can be parametrized by the privacy definition to facilitate building algorithms that can themselves be parametrized. To this end, we’ve added the make_noise and make_noise_threshold constructors that switch between laplace and gaussian noise depending on the choice of privacy definition. Other APIs following this convention include make_noisy_max, make_noisy_top_k, make_private_lazyframe, make_private_quantile and make_user_measurement.

The functionality of OpenDP 0.14 is tightly integrated: one release from a synthetic data mechanism may use identifier truncation, privacy filters, hierarchical queries, the updated thresholded noise mechanisms, the updated private selection mechanisms, and parameterizable privacy definitions, all together in one simple and integrated API.

Privacy Proofs

One of the ways the OpenDP Project ensures that mechanisms in the library are trustworthy is by writing privacy proofs. In this update, we have vastly expanded coverage of proofs for core mechanisms in the library. All additive noise mechanisms, thresholded noise mechanisms, report noisy top k mechanisms, randomized response mechanisms and compositors now have privacy proofs, as well as expanded and updated proof documents throughout the OpenDP-Polars integration.

PCA Privacy Fix

The mechanism for eigenvector release (used by PCA) had an incorrectly-implemented rejection sampler.

Thanks to our summer intern Rita Ionides for identifying and fixing this vulnerability.

Porting to 0.14

Renamed compositor APIs:
- make_composition: previously make_basic_composition
- make_adaptive_composition: previously make_sequential_composition
RNM Gumbel renaming and privacy measure parametrization
- make_noisy_max: from make_report_noisy_max_gumbel
API Change
- make_private_quantile: parametrized with output_measure
- make_laplace_threshold: input_metric is now L0PInfDistance
- make_gaussian_threshold: input_metric is now L0PInfDistance
dp.polars.Margin kwargs
- public_info → invariant
- max_partition_length → max_length
- max_num_partitions → max_groups

All these changes and more are summarized in the CHANGELOG. Check out the new functionality and let us know what you think!