Dataset-to-Dataset Evaluation Before (and Without) Sharing Data

Keren Fuentes; Mimee Xu; Irene Y. Chen

doi:10.1609/aies.v8i1.36604

Authors

Keren Fuentes Independent Researcher
Mimee Xu New York University
Irene Y. Chen UC Berkeley

DOI:

https://doi.org/10.1609/aies.v8i1.36604

Abstract

Privacy concerns and competitive interests impede data access for machine learning, due to the inability to privately assess external data's utility. This dynamic disadvantages smaller organizations that lack resources to aggressively pursue data-sharing agreements. In data-limited scenarios, not all external data is beneficial, and collaborations suffer especially in heavily regulated domains: metrics that aim to assess external data given a source e.g., approximating their KL-divergence, require accessing samples from both entities pre-collaboration, hence violating privacy. This conundrum disempowers legitimate data-sharing, leading to a false ``privacy-utility trade-off". To resolve privacy and uncertainty tensions simultaneously, we introduce SecureKL, the first secure protocol for dataset-to-dataset evaluations with zero privacy leakage, designed to be applied preceding data sharing. SecureKL evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations, all without assuming downstream models. On real-world data, SecureKL achieves high consistency (>90% correlation with non-private counterparts) and successfully identifies beneficial data collaborations in highly-heterogeneous domains (ICU mortality prediction across hospitals and income prediction across states). Our results highlight that secure computation maximizes data utilization, outperforming privacy-agnostic utility assessments that leak information.

Dataset-to-Dataset Evaluation Before (and Without) Sharing Data

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section