Aldo Pacchiano

Autonomous Discovery from Data

Wed, 21 Jan 2026 00:00:00 +0000

Overview

Sequential decision-making algorithms are used to interact with environments where the objective is to learn a policy that improves with data. Among others these are scenarios such as bandit problems, where one wants to compete against the best arm in hindsight, reinforcement learning domains, where finding an optimal policy is paramount, or active learning settings where the goal is to produce an accurate model of the world in as few online interactions as possible. Applications where these models are used include scenarios such as robotics and experiment design in scientific domains. Designing sequential-decision making algorithms typically involves a careful and painstaking modeling process that requires deep domain knowledge as well as mathematical proficiency in the techniques of sequential decision-making. In these works we explore an alternative way of producing algorithms for sequential decision-making domains. Instead, we will use the in-context learning abilities of transformer models to encode history dependent policies (i.e. algorithms) that will be learned from data by minimizing an appropriate loss. In our work Supervised Pretraining Can Learn In-Context Reinforcement Learning we design the Decision Pretrained Transformer (DPT) a supervised learning strategy over offline data that can be used to recover the Thompson Sampling algorithm. We expand on this line of work in our paper Learning to Explore: An In-Context Learning Approach for Pure Exploration where we introduce the In Context Pure Exploration (ICPE) algorithm. In this work learning is done via an online reinforcement learning approach. ICPE is able to recover a bayes optimal discovery strategy for sequential hypothesis testing adapted to specific problem families. Learning how to autonomously discover is an exciting area of research that our research group is proudly pioneering.

Papers:

The Dissimilarity Dimension

Fri, 27 Sep 2024 00:00:00 +0000

Overview

For a time the Eluder dimension has been used to provide bounds for optimistic algorithms in function approximation regimes. We introduce the dissimilarity dimension that provides us with tighter bounds. The dissimilarity dimension is a combinatorial statistical dimension that can be used to define the statistical complexity of optimistic algorithms in structured bandit problems. This statistical dimension’s definition is based on defining maximal sequences that achieve good historical fit while also satisfying what we call a large self evalution property that is connected with optimism. Using a graph theoretic argument, we prove the regret of optimistic least squares achieves a regret bound depending on this statistical dimension that is sharper than the typical eluder dimension bound.

Paper:

A Unified Model and Dimension for Interactive Estimation

Neural Optimism for Genetic Perturbation Experiments

Wed, 10 Aug 2022 00:00:00 +0000

Overview

The problem of how to genetically modify cells in order to maximize a certain cellular phenotype has taken center stage in drug development over the last few years (with, for example, genetically edited CAR-T, CAR-NK, and CAR-NKT cells entering cancer clinical trials). Exhausting the search space for all possible genetic edits (perturbations) or combinations thereof is infeasible due to cost and experimental limitations. This work provides a theoretically sound framework for iteratively exploring the space of perturbations in pooled batches in order to maximize a target phenotype under an experimental budget. Inspired by this application domain, we study the problem of batch query bandit optimization and introduce the Optimistic Arm Elimination (OAE) principle designed to find an almost optimal arm under different functional relationships between the queries (arms) and the outputs (rewards). In this work we analyze several tractable mechanisms for optimistic gene perturbation discovery that use neural network function approximation. We analyze the convergence properties of OAE by relating it to the Eluder dimension of the algorithm’s function class and validate that OAE outperforms other strategies in finding optimal actions in experiments on simulated problems, public datasets well-studied in bandit contexts, and in genetic perturbation datasets when the regression model is a deep neural network. OAE also outperforms the benchmark algorithms in 3 of 4 datasets in the GeneDisco experimental planning challenge. Our algorithms can be used in problems beyond the setting of genetic perturbation experiments.

Paper

Neural Optimism for Genetic Perturbation Experiments.

Model Selection for Contextual Bandits and Reinforcement Learning

Mon, 27 Jun 2022 00:00:00 +0000

Overview

In many domains ranging from internet commerce, to robotics and computational biology, many algorithms have been developed that make decisions with the objective of maximizing a reward, while learning how to make better decisions in the future. In hopes of realizing this objective a vast literature focused on the study of Bandits and Reinforcement Learning algorithms has arisen. Although in most practical applications, precise knowledge of the nature of the problem faced by the learner may not be known in advance most of this work has chiefly focused on designing algorithms with provable regret guarantees that work under specific modeling assumptions. Less work has been spent on the problem of model selection where the objective is to design algorithms that can select in an online fashion the best suitable algorithm among a set of candidates to deal with a specific problem instance.

Papers

Model Selection in Contextual Stochastic Bandit Problems - In this work we introduce the Stochastic CORRAL algorithm for model selection in stochastic contextual bandit problems and RL. The Stochastic CORRAL algorithm makes use of a master algorithm based on the CORRAL master. We also show some initial lower bounds showing the imposibility of deriving model selection results in a variety of settings.
Online Model Selection for Reinforcement Learning with Function Approximation - We introduce the Explore-Commit-Eliminate (ECE) algorithm for model selection. This algorithm is suitable for model selection between stochastic bandit and reinforcement learning algorithms. ECE is based on a simple misspecification test coupled with a play schedule reminiscent of epsilon greedy algorithms.
Regret Bound Balancing and Elimination for Model Selection in Bandits and RL - This work proposes the Regret Bound Balancing and Elimination Algorithm (RBBE). This algorithm can be used for model selection among stochastic bandit and Reinforcement Learning algorithms. Similar to the ECE algorithm, RBBE is based on a simple misspecification test, in this case coupled with an algorithm play schedule based on balancing the base learner’s regret bounds. Additionally, we show this approach can be extended to the setting of adversarially generated contexts in stochastic contextual linear bandits.
Dynamic Balancing for Model Selection in Bandits and RL - This work extends and refines the RBBE approach, shaving polynomial factors in the number of models from the resulting model selection regret upper bounds.
Model Selection for Contextual Bandits and Reinforcement Learning - PhD Thesis. A compendium of the results introduced above. -Best of Both Worlds Model Selection - This works extends the regret balancing technology to work for adversarial bandits. It also explores the question of wether it is possible to obtain model selection and best of both worlds guarantees between adversarial and stochastic bandit scenarios.

Beyond the Standard Assumptions in Reinforcement Learning

Sun, 27 Mar 2022 00:00:00 +0000

Overview

We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner’s complete trajectory was either “good” or “bad,” but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sub-linear regret. We also study the corresponding Dueling Reinforcement Learning setting where the learner’s feedback comes in the form of noisy binary comparisons between trajectories.