Linear Contextual Bandits with Hybrid Payoffs: Revisited
Nirjhar Das and Gaurav Sinha
Accepted at ECML-PKDD 2024
paper /
abstract /
bibtex
We study the Linear Contextual Bandit ($\texttt{LinearCB}$) problem in the hybrid reward setting. In this setting, every arm's reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can easily reduce this setting to two closely related settings; (a) Shared - no arm specific parameters, and (b) Disjoint - only arm specific parameters, enabling the application of two popular state of the art algorithms - $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ (proposed as Algorithm $1$ in Li et al. 2010). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ that significantly improves upon the known regret guarantees of these algorithms. Our novel analysis critically exploits the structure of the hybrid rewards and diversity of the arm features. Along with proving these new guarantees, we introduce a new algorithm $\texttt{HyLinUCB}$ that crucially modifies $\texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that at the end of $T$ rounds, $\texttt{HyLinUCB}$ also incurs only $\tilde{O}(\sqrt{T})$ regret. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of $\texttt{HyLinUCB}$. When the number of arm specific parameters is much larger than the number of shared parameters, we observe that $\texttt{DisLinUCB}$ incurs the lowest regret. In this case, regret of $\texttt{HyLinUCB}$ is the second best and it is extremely competitive to \DisLinUCB. In all other situations, including our real-world dataset, $\texttt{HyLinUCB}$ has significantly lower regret than $\texttt{LinUCB}$, $\texttt{DisLinUCB}$ and other state of the art baselines we considered. We also empirically observe that the regret of $\texttt{HyLinUCB}$ grows much slower with the number of arms $K$, compared to baselines, making it suitable even for very large action spaces.
@InProceedings{das2024linear,
author="Das, Nirjhar, Sinha, Gaurav",
title="Linear Contextual Bandits with Hybrid Payoffs: Revisited",
booktitle="European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases",
year="2024"
}
Generalized Linear Contextual Bandits with Limited Adaptivity
Ayush Sawarni, Nirjhar Das, Siddharth Barman, and Gaurav Sinha
Spotlight at NeurIPS 2024
paper /
abstract /
bibtex
We study the generalized linear contextual bandit problem within the requirements of limited adaptivity. In this paper, we present two algorithms, $\texttt{B-GLinCB}$ and $\texttt{RS-GLinCB}$, that address, respectively, two prevalent limited adaptivity models: batch learning with stochastic contexts and rare policy switches with adversarial contexts. For both these models, we establish essentially tight regret bounds. Notably, in the obtained bounds, we manage to eliminate a dependence on a key parameter $\kappa$, which captures the non-linearity of the underlying reward model. For our batch learning algorithm $\texttt{B-GLinCB}$, with $\Omega\left( \log{\log T} \right)$ batches, the regret scales as $\tilde{O}(\sqrt{T})$. Further, we establish that our rarely switching algorithm $\texttt{RS-GLinCB}$\ updates its policy at most $\tilde{O}(\log^2 T)$ times and achieves a regret of $\tilde{O}(\sqrt{T})$. Our approach for removing the dependence on $\kappa$ for generalized linear
contextual bandits might be of independent interest.
@article{sawarni2024optimal,
author="Sawarni, Ayush, Das, Nirjhar, Barman, Siddharth, and Sinha, Gaurav",
title="Generalized Linear Contextual Bandits with Limited Adaptivity",
journal="arXiv preprint arXiv:2404.06831",
year="2024",
}
Active Preference Optimization for Sample Efficient RLHF
Nirjhar Das, Souradip Chakroborty, Aldo Pacchiano and Sayak Ray Chowdhury
Accepted at ICML 2024 Workshop on Theoretical Foundations of Foundation Models
paper /
abstract /
bibtex
Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. Although aligned generative models have shown remarkable abilities in numerous tasks, their reliance on high-quality human preference data creates a costly bottleneck in the practical application of RLHF. One primary reason is that the current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations, to collect human feedback, thereby resulting in sub-optimal alignment under a constrained budget, which highlights the criticality of adaptive strategies in efficient alignment. Recent works [Mehta et al. 2023, Muldrew et al. 2024] have tried to address this problem by designing various heuristics based on generation uncertainty. However, either the assumptions in [Mehta et al. 2023] are restrictive, or [Muldrew et al. 2024] do not provide any rigorous theoretical guarantee. To address these, we reformulate RLHF within the contextual preference bandit framework, treating prompts as contexts, and develop an active-learning algorithm, $\textit{Active Preference Optimization}$ ($\texttt{APO}$), which significantly enhances model alignment by querying preference data from the most important samples, thus achieving superior performance at a small sample budget. We analyze the theoretical performance guarantees of $\texttt{APO}$ under the Bradley-Terry-Luce (BTL) preference model showing that the suboptimality gap of the policy learned via $\texttt{APO}$ scales as $O(1/\sqrt{T})$ for a sample budget of $T$. We also show that collecting preference data by choosing prompts uniformly at random leads to a policy that suffers a constant sub-optimality. We perform detailed experimental evaluations on practical preference datasets to validate \texttt{APO}'s efficacy over the existing methods, establishing it as a sample-efficient and practical solution of alignment in a cost-effective and scalable manner.
@article{das2024active,
author="Das, Nirjhar,
Chakroborty, Souradip,
Pacchiano, Aldo,
Chowdhury, Sayak Ray",
title="Active Preference Optimization for Sample Efficient RLHF",
journal="arXiv preprint arXiv:2402.10500",
year="2024"
}
Inverse Reinforcement Learning With Constraint Recovery
Nirjhar Das and Arpan Chattopadhyay
Best Paper Award at 10th International Conference on Pattern Recognition and Machine Intelligence (PReMI) 2023 
paper /
slides /
abstract /
bibtex
In this work, we propose a novel inverse reinforcement learning (IRL) algorithm for constrained Markov decision process (CMDP) problems. In standard IRL problems, the inverse learner or agent seeks to recover the reward function of the MDP, given a set of trajectory demonstrations for the optimal policy. In this work, we seek to infer not only the reward functions of the CMDP, but also the constraints. Using the principle of maximum entropy, we show that the IRL with constraint recovery (IRL-CR) problem can be cast as a constrained non-convex optimization problem. We reduce it to an alternating constrained optimization problem whose sub-problems are convex. We use exponentiated gradient descent algorithm to solve it. Finally, we demonstrate the efficacy of our algorithm for the grid world environment.
@InProceedings{das2023inverse,
author="Das, Nirjhar
and Chattopadhyay, Arpan",
editor="Maji, Pradipta
and Huang, Tingwen
and Pal, Nikhil R.
and Chaudhury, Santanu
and De, Rajat K.",
title="Inverse Reinforcement Learning with Constraint Recovery",
booktitle="Pattern Recognition and Machine Intelligence",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="179--188"
}
A View Independent Classification Framework for Yoga Postures
Mustafa Chasmai,
Nirjhar Das, Aman Bhardwaj and Rahul Garg
Springer Nature Computer Science (SNCS), Vol. 3, 2022 
project /
paper /
abstract /
bibtex
Yoga is a globally acclaimed and widely recommended practice for a healthy living. Maintaining correct posture while performing a Yogasana is of utmost importance. In this work, we employ transfer learning from human pose estimation models for extracting 136 key-points spread all over the body to train a random forest classifier which is used for estimation of the Yogasanas. The results are evaluated on an in-house collected extensive yoga video database of 51 subjects recorded from four different camera angles. We use a three step scheme for evaluating the generalizability of a Yoga classifier by testing it on (1) unseen frames, (2) unseen subjects, and (3) unseen camera angles. We argue that for most of the applications, validation accuracies on unseen subjects and unseen camera angles would be most important. We empirically analyze over three public datasets, the advantage of transfer learning and the possibilities of target leakage. We further demonstrate that the classification accuracies critically depend on the cross validation method employed and can often be misleading. To promote further research, we have made key-points dataset and code publicly available.
@article{chasmai2022view,
title={A View Independent Classification Framework for Yoga Postures},
author={Chasmai, Mustafa and Das, Nirjhar and Bhardwaj, Aman and Garg, Rahul},
journal={Springer Nature Computer Science},
url = {https://doi.org/10.1007/s42979-022-01376-7},
year={2022}
}
Gene expression based inference of cancer drug sensitivity
Smriti Chawla, Anja Rockstroh, Melanie Lehman, Ellca Ratther, Atishay Jain, Anuneet Anand, Apoorva Gupta, Namrata Bhattacharya, Sarita Poonia, Priyadarshini Rai, Nirjhar Das, Angshul Majumdar, Jayadeva, Gaurav Ahuja, Brett G. Hollier, Colleen C. Nelson and Debarka Sengupta
Nature Communications, Vol. 13, 2022 
paper /
abstract /
bibtex
Inter and intra-tumoral heterogeneity are major stumbling blocks in the treatment of cancer and are responsible for imparting differential drug responses in cancer patients. Recently, the availability of high-throughput screening datasets has paved the way for machine learning based personalized therapy recommendations using the molecular profiles of cancer specimens. In this study, we introduce Precily, a predictive modeling approach to infer treatment response in cancers using gene expression data. In this context, we demonstrate the benefits of considering pathway activity estimates in tandem with drug descriptors as features. We apply Precily on single-cell and bulk RNA sequencing data associated with hundreds of cancer cell lines. We then assess the predictability of treatment outcomes using our in-house prostate cancer cell line and xenografts datasets exposed to differential treatment conditions. Further, we demonstrate the applicability of our approach on patient drug response data from The Cancer Genome Atlas and an independent clinical study describing the treatment journey of three melanoma patients. Our findings highlight the importance of chemo-transcriptomics approaches in cancer treatment selection.
@article{Chawla2022,
author={Chawla, Smriti and Rockstroh, Anja and Lehman, Melanie and Ratther, Ellca and Jain, Atishay and Anand, Anuneet and Gupta, Apoorva and Bhattacharya, Namrata and Poonia, Sarita and Rai, Priyadarshini and Das, Nirjhar and Majumdar, Angshul and {Jayadeva} and Ahuja, Gaurav and Hollier, Brett G. and Nelson, Colleen C. and Sengupta, Debarka},
title={Gene expression based inference of cancer drug sensitivity},
journal={Nature Communications},
year={2022},
month={Sep},
day={27},
volume={13},
number={1},
pages={5680},
issn={2041-1723},
doi={10.1038/s41467-022-33291-z},
url={https://doi.org/10.1038/s41467-022-33291-z}
}
|