Gaussian Process Bandits with Aggregated Feedback

Mengyan Zhang; Russell Tsuchida; Cheng Soon Ong

doi:10.1609/aaai.v36i8.20892

Authors

Mengyan Zhang The Australian National University Data6, CSIRO
Russell Tsuchida Data61, CSIRO
Cheng Soon Ong Data6, CSIRO The Australian National University

DOI:

https://doi.org/10.1609/aaai.v36i8.20892

Keywords:

Machine Learning (ML)

Abstract

We consider the continuum-armed bandits problem, under a novel setting of recommending the best arms within a fixed budget under aggregated feedback. This is motivated by applications where the precise rewards are impossible or expensive to obtain, while an aggregated reward or feedback, such as the average over a subset, is available. We constrain the set of reward functions by assuming that they are from a Gaussian Process and propose the Gaussian Process Optimistic Optimisation (GPOO) algorithm. We adaptively construct a tree with nodes as subsets of the arm space, where the feedback is the aggregated reward of representatives of a node. We propose a new simple regret notion with respect to aggregated feedback on the recommended arms. We provide theoretical analysis for the proposed algorithm, and recover single point feedback as a special case. We illustrate GPOO and compare it with related algorithms on simulated data.

Gaussian Process Bandits with Aggregated Feedback

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription