A Machine Learning Based System for Semi-Automatically Redacting Documents


  • Chad Cumby Accenture Technology Labs
  • Rayid Ghani Accenture Technology Labs




Redacting text documents has traditionally been a mostly manual activity, making it expensive and prone to disclosure risks. This paper describes a semi-automated system to en- sure a specified level of privacy in text data sets. Recent work has attempted to quantify the likelihood of privacy breaches for text data. We build on these notions to provide a means of obstructing such breaches by framing it as a multi-class classification problem. Our system gives users fine-grained control over the level of privacy needed to obstruct sensi- tive concepts present in that data. Additionally, our system is designed to respect a user-defined utility metric on the data (such as disclosure of a particular concept), which our methods try to maximize while anonymizing. We describe our redaction framework, algorithms, as well as a prototype tool built in to Microsoft Word that allows enterprise users to redact documents before sharing them internally and obscure client specific information. In addition we show experimen- tal evaluation using publicly available data sets that show the effectiveness of our approach against both automated attack- ers and human subjects.The results show that we are able to preserve the utility of a text corpus while reducing disclosure risk of the sensitive concept.




How to Cite

Cumby, C., & Ghani, R. (2011). A Machine Learning Based System for Semi-Automatically Redacting Documents. Proceedings of the AAAI Conference on Artificial Intelligence, 25(2), 1628-1635. https://doi.org/10.1609/aaai.v25i2.18851