UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Authors

  • Gengrui Zhang Kuaishou Technology
  • Yao Wang Kuaishou Technology
  • Xiaoshuang Chen Kuaishou Technology
  • Hongyi Qian Beihang University
  • Kaiqiao Zhan Kuaishou Technology
  • Ben Wang Kuaishou Technology

DOI:

https://doi.org/10.1609/aaai.v38i8.28783

Keywords:

DMKM: Recommender Systems

Abstract

In recent years, there has been a growing interest in utilizing reinforcement learning (RL) to optimize long-term rewards in recommender systems. Since industrial recommender systems are typically designed as multi-stage systems, RL methods with a single agent face challenges when optimizing multiple stages simultaneously. The reason is that different stages have different observation spaces, and thus cannot be modeled by a single agent. To address this issue, we propose a novel UNidirectional-EXecution-based multi-agent Reinforcement Learning (UNEX-RL) framework to reinforce the long-term rewards in multi-stage recommender systems. We show that the unidirectional execution is a key feature of multi-stage recommender systems, bringing new challenges to the applications of multi-agent reinforcement learning (MARL), namely the observation dependency and the cascading effect. To tackle these challenges, we provide a cascading information chain (CIC) method to separate the independent observations from action-dependent observations and use CIC to train UNEX-RL effectively. We also discuss practical variance reduction techniques for UNEX-RL. Finally, we show the effectiveness of UNEX-RL on both public datasets and an online recommender system with over 100 million users. Specifically, UNEX-RL reveals a 0.558% increase in users' usage time compared with single-agent RL algorithms in online A/B experiments, highlighting the effectiveness of UNEX-RL in industrial recommender systems.

Published

2024-03-24

How to Cite

Zhang, G., Wang, Y., Chen, X., Qian, H., Zhan, K., & Wang, B. (2024). UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution. Proceedings of the AAAI Conference on Artificial Intelligence, 38(8), 9305-9313. https://doi.org/10.1609/aaai.v38i8.28783

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management