What is Differential Privacy?

Photo of people socializing at a reception at the Harvard Science and Engineering Complex

Differential privacy is a rigorous mathematical definition of privacy for statistical analysis and machine learning. In the simplest setting, consider an algorithm that analyzes a dataset and releases statistics about it (such as means and variances, cross-tabulations, or the parameters of a machine learning model). Such an algorithm is said to be differentially private if by looking at the output, one cannot tell whether any individual’s data was included in the original dataset or not. In other words, the guarantee of a differentially private algorithm is that its behavior hardly changes when a single individual joins or leaves the dataset— anything the algorithm might output on a database containing some individual’s information is almost as likely to have come from a database without that individual’s information. Most notably, this guarantee holds for every individual and every dataset. Therefore, regardless of how eccentric any single individual’s details are, and regardless of the details of anyone else in the database, the guarantee of differential privacy still holds. This gives a formal guarantee that individual-level information about participants in the database is not leaked. Differential privacy achieves this strong guarantee by carefully injecting random noise into computation of the released statistics, so as to hide the effect of each individual.

For more background on differential privacy and its applications, we recommend the book chapter by Alexandra Wood, Micah Altman, Kobbi Nissim, and Salil Vadhan, as well as the resources at https://differentialprivacy.org/resources/ and https://privacytools.seas.harvard.edu/courses-educational-materials.