Counterfactual Learning with General Data-Generating Policies

Authors

  • Yusuke Narita Yale University
  • Kyohei Okumura Northwestern University
  • Akihiro Shimizu Mercari, Inc.,
  • Kohei Yata University of Wisconsin-Madison

DOI:

https://doi.org/10.1609/aaai.v37i8.26113

Keywords:

ML: Reinforcement Learning Theory, APP: Business/Marketing/Advertising/E-Commerce, ML: Causal Learning, ML: Online Learning & Bandits, ML: Reinforcement Learning Algorithms

Abstract

Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.

Downloads

Published

2023-06-26

How to Cite

Narita, Y., Okumura, K., Shimizu, A., & Yata, K. (2023). Counterfactual Learning with General Data-Generating Policies. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8), 9286-9293. https://doi.org/10.1609/aaai.v37i8.26113

Issue

Section

AAAI Technical Track on Machine Learning III