PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research
Authors: Yuzhou Jiang (Case Western Reserve University), Tianxi Ji (Texas Tech University), Erman Ayday (Case Western Reserve University)
Volume: 2026
Issue: 2
Pages: 642–661
DOI: https://doi.org/10.56553/popets-2026-0064
Abstract: As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism tailored to biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with public MAFs using optimal transport. Finally, we convert the processed binary data back into its genomic representation and publish the resulting dataset. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. Our results show that PROVGEN overall outperforms existing approaches in detecting GWAS outcome errors, preserving data fidelity, and resisting membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality for reproducibility of their findings.
Keywords: Genomic privacy, differential privacy, genome-wide association studies, reproducibility
Copyright in PoPETs articles are held by their authors. This article is published under a Creative Commons Attribution 4.0 license.