RR-PU: A Synergistic Two-Stage Positive and Unlabeled Learning Framework for Robust Tax Evasion Detection

Authors

  • Shuzhi Cao School of Computer Science and Technology, Xi'an Jiaotong University, China Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China
  • Jianfei Ruan School of Computer Science and Technology, Xi'an Jiaotong University, China Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China
  • Bo Dong Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China School of Distance Education, Xi'an Jiaotong University, China
  • Bin Shi School of Computer Science and Technology, Xi'an Jiaotong University, China Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China
  • Qinghua Zheng School of Computer Science and Technology, Xi'an Jiaotong University, China Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China

DOI:

https://doi.org/10.1609/aaai.v38i8.28665

Keywords:

DMKM: Anomaly/Outlier Detection, ML: Semi-Supervised Learning

Abstract

Tax evasion, an unlawful practice in which taxpayers deliberately conceal information to avoid paying tax liabilities, poses significant challenges for tax authorities. Effective tax evasion detection is critical for assisting tax authorities in mitigating tax revenue loss. Recently, machine-learning-based methods, particularly those employing positive and unlabeled (PU) learning, have been adopted for tax evasion detection, achieving notable success. However, these methods exhibit two major practical limitations. First, their success heavily relies on the strong assumption that the label frequency (the fraction of identified taxpayers among tax evaders) is known in advance. Second, although some methods attempt to estimate label frequency using approaches like Mixture Proportion Estimation (MPE) without making any assumptions, they subsequently construct a classifier based on the error-prone label frequency obtained from the previous estimation. This two-stage approach may not be optimal, as it neglects error accumulation in classifier training resulting from the estimation bias in the first stage. To address these limitations, we propose a novel PU learning-based tax evasion detection framework called RR-PU, which can revise the bias in a two-stage synergistic manner. Specifically, RR-PU refines the label frequency initialization by leveraging a regrouping technique to fortify the MPE perspective. Subsequently, we integrate a trainable slack variable to fine-tune the initial label frequency, concurrently optimizing this variable and the classifier to eliminate latent bias in the initial stage. Experimental results on three real-world tax datasets demonstrate that RR-PU outperforms state-of-the-art methods in tax evasion detection tasks.

Published

2024-03-24

How to Cite

Cao, S., Ruan, J., Dong, B., Shi, B., & Zheng, Q. (2024). RR-PU: A Synergistic Two-Stage Positive and Unlabeled Learning Framework for Robust Tax Evasion Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 38(8), 8246-8254. https://doi.org/10.1609/aaai.v38i8.28665

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management