Dataset-to-Dataset Evaluation Before (and Without) Sharing Data
DOI:
https://doi.org/10.1609/aies.v8i1.36604Abstract
Privacy concerns and competitive interests impede data access for machine learning, due to the inability to privately assess external data's utility. This dynamic disadvantages smaller organizations that lack resources to aggressively pursue data-sharing agreements. In data-limited scenarios, not all external data is beneficial, and collaborations suffer especially in heavily regulated domains: metrics that aim to assess external data given a source e.g., approximating their KL-divergence, require accessing samples from both entities pre-collaboration, hence violating privacy. This conundrum disempowers legitimate data-sharing, leading to a false ``privacy-utility trade-off". To resolve privacy and uncertainty tensions simultaneously, we introduce SecureKL, the first secure protocol for dataset-to-dataset evaluations with zero privacy leakage, designed to be applied preceding data sharing. SecureKL evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations, all without assuming downstream models. On real-world data, SecureKL achieves high consistency (>90% correlation with non-private counterparts) and successfully identifies beneficial data collaborations in highly-heterogeneous domains (ICU mortality prediction across hospitals and income prediction across states). Our results highlight that secure computation maximizes data utilization, outperforming privacy-agnostic utility assessments that leak information.Downloads
Published
2025-10-15
How to Cite
Fuentes, K., Xu, M., & Chen, I. Y. (2025). Dataset-to-Dataset Evaluation Before (and Without) Sharing Data. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1), 963-977. https://doi.org/10.1609/aies.v8i1.36604