Machine Learning with Differentially Private Labels: Mechanisms and Frameworks

Authors: Xinyu Tang (Princeton University), Milad Nasr (University of Massachusetts Amherst), Saeed Mahloujifar (Princeton University), Virat Shejwalkar (University of Massachusetts Amherst), Liwei Song (Princeton University), Amir Houmansadr (University of Massachusetts Amherst), Prateek Mittal (Princeton University)

Volume: 2022
Issue: 4
Pages: 332–350
DOI: https://doi.org/10.56553/popets-2022-0112

artifact

Download PDF

Abstract: Label differential privacy is a relaxation of differential privacy for machine learning scenarios where the labels are the only sensitive information that needs to be protected in the training data. For example, imagine a survey from a participant in a university class about their vaccination status. Some attributes of the students are publicly available but their vaccination status is sensitive information and must remain private. Now if we want to train a model that predicts whether a student has received vaccination using only their public information, we can use label-DP. Recent works on label-DP use different ways of adding noise to the labels in order to obtain label-DP models. In this work, we present novel techniques for training models with label-DP guarantees by leveraging unsupervised learning and semi-supervised learning, enabling us to inject less noise while obtaining the same privacy, therefore achieving a better utility-privacy trade-off. We first introduce a framework that starts with an unsupervised classifier f0 and dataset D with noisy label set Y , reduces the noise in Y using f0 , and then trains a new model f using the less noisy dataset. Our noise reduction strategy uses the model f0 to remove the noisy labels that are incorrect with high probability. Then we use semi-supervised learning to train a model using the remaining labels. We instantiate this framework with multiple ways of obtaining the noisy labels and also the base classifier. As an alternative way to reduce the noise, we explore the effect of using unsupervised learning: we only add noise to a majority voting step for associating the learned clusters with a cluster label (as opposed to adding noise to individual labels); the reduced sensitivity enables us to add less noise. Our experiments show that these techniques can significantly outperform the prior works on label-DP.

Keywords: Differential Privacy; Label Differential Privacy

Copyright in PoPETs articles are held by their authors. This article is published under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license.