MetaTrader: Learning to Generalize RL Trading Policies Beyond Offline Data

Authors

  • Haochen Yuan Shanghai Jiao Tong University
  • Minting Pan Shanghai Jiao Tong University
  • Yunbo Wang Shanghai Jiao Tong University
  • Siyu Gao China International Capital Corporation Limited
  • Xiaokang Yang Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v40i33.40027

Abstract

Reinforcement learning (RL) has shown significant promise in sequential portfolio optimization. A typical solution involves optimizing cumulative returns using historical offline data. However, it may produce less generalizable policies that merely ''memorize'' optimal buying and selling actions from the offline data while neglecting the non-stationary nature of the financial market. We frame portfolio optimization of stock data as a specific type of offline RL problem. Our method, MetaTrader, presents two key contributions. First, it introduces a novel bilevel RL algorithm that operates on both the original stock data and its transformations. The core idea is that a robust policy should generalize effectively to out-of-distribution data. Second, we propose a new temporal difference (TD) method that leverages a transformation-based conservative TD target to address value overestimation under limited offline data. Empirical results on two publicly available datasets demonstrate that MetaTrader outperforms existing methods, including both traditional stock prediction models and RL-based trading approaches.

Downloads

Published

2026-03-14

How to Cite

Yuan, H., Pan, M., Wang, Y., Gao, S., & Yang, X. (2026). MetaTrader: Learning to Generalize RL Trading Policies Beyond Offline Data. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 28023–28031. https://doi.org/10.1609/aaai.v40i33.40027

Issue

Section

AAAI Technical Track on Machine Learning X