Sanitization or Deception? Rethinking Privacy Protection in Large Language Models

Authors: Bipin Paudel (Kansas State University), Bishwas Mandal (Kansas State University), George Amariucai (Kansas State University), Shuangqing Wei (Louisiana State University)

Volume: 2026
Issue: 1
Pages: 154–174
DOI: https://doi.org/10.56553/popets-2026-0009

Download PDF

Abstract: Large language models have shown considerable abilities across many tasks, but their capacity to detect sensitive user information from text raises significant privacy concerns. While recent approaches have explored sanitizing text to hide private features, a deeper challenge remains: distinguishing true privacy preservation from deceptive transformations. In this paper, we investigate whether LLM-based sanitization reduces private feature leakage without misleading an adversary into confidently predicting incorrect labels. Using LLM as both sanitizer and adversary, we measure leakage using two entropy-based metrics: Empirical Average Objective Leakage (E-AOL) and Empirical Average Confidence Boost (E-ACB). These allow us to quantify not only how accurate adversarial predictions are, but also how confident they remain post-sanitization. We posit that deception, while reducing adversarial accuracy, will also increase confidence in incorrect inferences, and hence reduced accuracy alone should not be interpreted as true privacy. We show that while current LLMs can hide private features, their transformations sometimes cause deception. Finally, we evaluate the semantic utility of sanitized outputs using sentence embeddings, LLM-based similarity judgments, and standard metrics like BLEU and ROUGE. Our findings emphasize the importance of explicitly distinguishing between privacy and deception in LLM-based sanitization and provide a framework for evaluating this distinction under realistic adversarial conditions.

Keywords: textual privacy, privacy vs deception, min-entropy metrics

Copyright in PoPETs articles are held by their authors. This article is published under a Creative Commons Attribution 4.0 license.