Home Experience Publications Projects



Publications & Preprints

Active Preference Optimization for Sample Efficient RLHF

Nirjhar Das, Souradip Chakroborty, Aldo Pacchiano and Sayak Ray Chowdhury
Accepted at ECML-PKDD 2025 & ICML 2024 TF2M Workshop
paper / abstract / bibtex

Large Language Models (LLMs) aligned using Reinforcement Learning from Human Feedback (RLHF) have shown remarkable generation abilities in numerous tasks. However, collecting high-quality human preferences creates costly bottlenecks in practical deployments, and hence, training data are often budgeted. In these scenarios, it is crucial to collect training data (e.g., contexts, a pair of generations for each context, and a preference indicating which generation is better) carefully, yet most of the existing methods sample contexts uniformly at random from a given collection. Given this, under the Bradley-Terry-Luce preference model and with a small budget of training data, we show that uniform sampling of contexts could lead to a policy (i.e., an aligned model) that suffers a constant sub-optimality gap from the optimal policy. This highlights the need for an adaptive context sampling strategy for effective alignment under a small sample budget. To address this, we reformulate RLHF within the contextual preference bandit framework, treating generations as actions, and give a nearly complete characterization of the sub-optimality gap in terms of both lower and upper bounds. First, when the action set is a $d$-dimensional hypercube and the number of samples is $T$, we show an $\Omega(d/\sqrt{T})$ lower bound. Next, we propose an algorithm, \emph{Active Preference Optimization} (\texttt{APO}), that iteratively collects preferences for the most uncertain contexts. We show that the sub-optimality gap of the policy learned via \texttt{APO} matches the lower bound up to a log factor and a non-linearity constant. Finally, we perform experiments on practical datasets to validate \texttt{APO}'s efficacy over existing methods, establishing it as a sample-efficient and cost-effective solution for LLM alignment.

@inproceedings{das2024active,
title={Active Preference Optimization for Sample Efficient RLHF},
author={Das, Nirjhar and Chakroborty, Souradip and Pacchiano, Aldo and Chowdhury, Sayak Ray},
booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
year={2025},
organization={Springer}
}

Generalized Linear Bandits with Limited Adaptivity

Ayush Sawarni, Nirjhar Das, Siddharth Barman, and Gaurav Sinha
Spotlight at NeurIPS 2024
paper / abstract / bibtex

We study the generalized linear contextual bandit problem within the requirements of limited adaptivity. In this paper, we present two algorithms, $\texttt{B-GLinCB}$ and $\texttt{RS-GLinCB}$, that address, respectively, two prevalent limited adaptivity models: batch learning with stochastic contexts and rare policy switches with adversarial contexts. For both these models, we establish essentially tight regret bounds. Notably, in the obtained bounds, we manage to eliminate a dependence on a key parameter $\kappa$, which captures the non-linearity of the underlying reward model. For our batch learning algorithm $\texttt{B-GLinCB}$, with $\Omega\left( \log{\log T} \right)$ batches, the regret scales as $\tilde{O}(\sqrt{T})$. Further, we establish that our rarely switching algorithm $\texttt{RS-GLinCB}$\ updates its policy at most $\tilde{O}(\log^2 T)$ times and achieves a regret of $\tilde{O}(\sqrt{T})$. Our approach for removing the dependence on $\kappa$ for generalized linear contextual bandits might be of independent interest.

@inproceedings{sawarni2024Generalized,
author = {Sawarni, Ayush and Das, Nirjhar and Barman, Siddharth and Sinha, Gaurav},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages = {8329--8369},
publisher = {Curran Associates, Inc.},
title = {Generalized Linear Bandits with Limited Adaptivity},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/0faa0019b0a8fcab8e6476bc43078e2e-Paper-Conference.pdf},
volume = {37},
year = {2024}
}

Linear Contextual Bandits with Hybrid Payoffs: Revisited

Nirjhar Das and Gaurav Sinha
Accepted at ECML-PKDD 2024
paper / abstract / bibtex

We study the Linear Contextual Bandit ($\texttt{LinearCB}$) problem in the hybrid reward setting. In this setting, every arm's reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can easily reduce this setting to two closely related settings; (a) Shared - no arm specific parameters, and (b) Disjoint - only arm specific parameters, enabling the application of two popular state of the art algorithms - $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ (proposed as Algorithm $1$ in Li et al. 2010). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ that significantly improves upon the known regret guarantees of these algorithms. Our novel analysis critically exploits the structure of the hybrid rewards and diversity of the arm features. Along with proving these new guarantees, we introduce a new algorithm $\texttt{HyLinUCB}$ that crucially modifies $\texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that at the end of $T$ rounds, $\texttt{HyLinUCB}$ also incurs only $\tilde{O}(\sqrt{T})$ regret. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of $\texttt{HyLinUCB}$. When the number of arm specific parameters is much larger than the number of shared parameters, we observe that $\texttt{DisLinUCB}$ incurs the lowest regret. In this case, regret of $\texttt{HyLinUCB}$ is the second best and it is extremely competitive to \DisLinUCB. In all other situations, including our real-world dataset, $\texttt{HyLinUCB}$ has significantly lower regret than $\texttt{LinUCB}$, $\texttt{DisLinUCB}$ and other state of the art baselines we considered. We also empirically observe that the regret of $\texttt{HyLinUCB}$ grows much slower with the number of arms $K$, compared to baselines, making it suitable even for very large action spaces.

@inproceedings{das2024linear,
title={Linear Contextual Bandits with Hybrid Payoff: Revisited},
author={Das, Nirjhar and Sinha, Gaurav},
booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
pages={441--455},
year={2024},
organization={Springer}
}

Inverse Reinforcement Learning With Constraint Recovery

Nirjhar Das and Arpan Chattopadhyay
Best Paper Award at 10th International Conference on Pattern Recognition and Machine Intelligence (PReMI) 2023 
paper / slides / abstract / bibtex

In this work, we propose a novel inverse reinforcement learning (IRL) algorithm for constrained Markov decision process (CMDP) problems. In standard IRL problems, the inverse learner or agent seeks to recover the reward function of the MDP, given a set of trajectory demonstrations for the optimal policy. In this work, we seek to infer not only the reward functions of the CMDP, but also the constraints. Using the principle of maximum entropy, we show that the IRL with constraint recovery (IRL-CR) problem can be cast as a constrained non-convex optimization problem. We reduce it to an alternating constrained optimization problem whose sub-problems are convex. We use exponentiated gradient descent algorithm to solve it. Finally, we demonstrate the efficacy of our algorithm for the grid world environment.

@InProceedings{das2023inverse,
author="Das, Nirjhar and Chattopadhyay, Arpan",
editor="Maji, Pradipta and Huang, Tingwen and Pal, Nikhil R. and Chaudhury, Santanu and De, Rajat K.",
title="Inverse Reinforcement Learning with Constraint Recovery",
booktitle="Pattern Recognition and Machine Intelligence",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="179--188"
}

A View Independent Classification Framework for Yoga Postures

Mustafa Chasmai, Nirjhar Das, Aman Bhardwaj and Rahul Garg
Springer Nature Computer Science (SNCS), Vol. 3, 2022 
project / paper / abstract / bibtex

Yoga is a globally acclaimed and widely recommended practice for a healthy living. Maintaining correct posture while performing a Yogasana is of utmost importance. In this work, we employ transfer learning from human pose estimation models for extracting 136 key-points spread all over the body to train a random forest classifier which is used for estimation of the Yogasanas. The results are evaluated on an in-house collected extensive yoga video database of 51 subjects recorded from four different camera angles. We use a three step scheme for evaluating the generalizability of a Yoga classifier by testing it on (1) unseen frames, (2) unseen subjects, and (3) unseen camera angles. We argue that for most of the applications, validation accuracies on unseen subjects and unseen camera angles would be most important. We empirically analyze over three public datasets, the advantage of transfer learning and the possibilities of target leakage. We further demonstrate that the classification accuracies critically depend on the cross validation method employed and can often be misleading. To promote further research, we have made key-points dataset and code publicly available.

@article{chasmai2022view,
title={A View Independent Classification Framework for Yoga Postures},
author={Chasmai, Mustafa and Das, Nirjhar and Bhardwaj, Aman and Garg, Rahul},
journal={Springer Nature Computer Science},
url = {https://doi.org/10.1007/s42979-022-01376-7},
year={2022}
}

Gene expression based inference of cancer drug sensitivity

Smriti Chawla, Anja Rockstroh, Melanie Lehman, Ellca Ratther, Atishay Jain, Anuneet Anand, Apoorva Gupta, Namrata Bhattacharya, Sarita Poonia, Priyadarshini Rai, Nirjhar Das, Angshul Majumdar, Jayadeva, Gaurav Ahuja, Brett G. Hollier, Colleen C. Nelson and Debarka Sengupta
Nature Communications, Vol. 13, 2022 
paper / abstract / bibtex

Inter and intra-tumoral heterogeneity are major stumbling blocks in the treatment of cancer and are responsible for imparting differential drug responses in cancer patients. Recently, the availability of high-throughput screening datasets has paved the way for machine learning based personalized therapy recommendations using the molecular profiles of cancer specimens. In this study, we introduce Precily, a predictive modeling approach to infer treatment response in cancers using gene expression data. In this context, we demonstrate the benefits of considering pathway activity estimates in tandem with drug descriptors as features. We apply Precily on single-cell and bulk RNA sequencing data associated with hundreds of cancer cell lines. We then assess the predictability of treatment outcomes using our in-house prostate cancer cell line and xenografts datasets exposed to differential treatment conditions. Further, we demonstrate the applicability of our approach on patient drug response data from The Cancer Genome Atlas and an independent clinical study describing the treatment journey of three melanoma patients. Our findings highlight the importance of chemo-transcriptomics approaches in cancer treatment selection.

@article{Chawla2022, author={Chawla, Smriti and Rockstroh, Anja and Lehman, Melanie and Ratther, Ellca and Jain, Atishay and Anand, Anuneet and Gupta, Apoorva and Bhattacharya, Namrata and Poonia, Sarita and Rai, Priyadarshini and Das, Nirjhar and Majumdar, Angshul and {Jayadeva} and Ahuja, Gaurav and Hollier, Brett G. and Nelson, Colleen C. and Sengupta, Debarka},
title={Gene expression based inference of cancer drug sensitivity},
journal={Nature Communications},
year={2022},
month={Sep},
day={27},
volume={13},
number={1},
pages={5680},
issn={2041-1723},
doi={10.1038/s41467-022-33291-z},
url={https://doi.org/10.1038/s41467-022-33291-z}
}