NondBREM: Nondeterministic Offline Reinforcement Learning for Large-Scale Order Dispatching

Hongbo Zhang; Guang Wang; Xu Wang; Zhengyang Zhou; Chen Zhang; Zheng Dong; Yang Wang

doi:10.1609/aaai.v38i1.27794

Authors

Hongbo Zhang University of Science and Technology of China
Guang Wang Florida State University
Xu Wang University of Science and Technology of China
Zhengyang Zhou University of Science and Technology of China
Chen Zhang University of Science and Technology of China
Zheng Dong Wayne State University
Yang Wang University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v38i1.27794

Keywords:

APP: Transportation, APP: Mobility, Driving & Flight, DMKM: Mining of Spatial, Temporal or Spatio-Temporal Data

Abstract

One of the most important tasks in ride-hailing is order dispatching, i.e., assigning unserved orders to available drivers. Recent order dispatching has achieved a significant improvement due to the advance of reinforcement learning, which has been approved to be able to effectively address sequential decision-making problems like order dispatching. However, most existing reinforcement learning methods require agents to learn the optimal policy by interacting with environments online, which is challenging or impractical for real-world deployment due to high costs or safety concerns. For example, due to the spatiotemporally unbalanced supply and demand, online reinforcement learning-based order dispatching may significantly impact the revenue of the ride-hailing platform and passenger experience during the policy learning period. Hence, in this work, we develop an offline deep reinforcement learning framework called NondBREM for large-scale order dispatching, which learns policy from only the accumulated logged data to avoid costly and unsafe interactions with the environment. In NondBREM, a Nondeterministic Batch-Constrained Q-learning (NondBCQ) module is developed to reduce the algorithm extrapolation error and a Random Ensemble Mixture (REM) module that integrates multiple value networks with multi-head networks is utilized to improve the model generalization and robustness. Extensive experiments on large-scale real-world ride-hailing datasets show the superiority of our design.

NondBREM: Nondeterministic Offline Reinforcement Learning for Large-Scale Order Dispatching

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription