DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients

Authors

  • Yuexin Bian University of California, San Diego
  • Jie Feng University of California, San Diego
  • Yuanyuan Shi University of California, San Diego

DOI:

https://doi.org/10.1609/aaai.v40i24.39055

Abstract

Real-world control systems require policies that are not only high-performing but also interpretable and robust. A promising direction toward this goal is model-based control, which learns system dynamics and cost functions from historical data and then uses these models to inform decision-making. Building on this paradigm, we introduce DiffOP, a novel framework for learning optimization-based control policies defined implicitly through optimization control problems. Without relying on value function approximation, DiffOP jointly learns the cost and dynamics models and directly optimizes the actual control costs using policy gradients. To enable this, we derive analytical policy gradients by applying implicit differentiation to the underlying optimization problem and integrating it with the standard policy gradient framework. Under standard regularity conditions, we establish that DiffOP converges to an epsilon-stationary point within O(1/epsilon) iterations. We demonstrate the effectiveness of DiffOP through experiments on nonlinear control tasks and power system voltage control with constraints.

Downloads

Published

2026-03-14

How to Cite

Bian, Y., Feng, J., & Shi, Y. (2026). DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients. Proceedings of the AAAI Conference on Artificial Intelligence, 40(24), 19737-19745. https://doi.org/10.1609/aaai.v40i24.39055

Issue

Section

AAAI Technical Track on Machine Learning I