Seven Major Directions and Trends in Modern Statistics
- In what follows, I do not cover the analysis of particular data types, e.g., time-to-event data, spatial/spatiotemporal data, time series data, functional data, network data, etc. Analyses of these specific data types (among others) certainly remain very active areas of research.
Rather than focusing on any particular type of data, I focus on “broad” modeling frameworks and paradigms. Indeed, all of the aforementioned types of data can be studied through the lens of the frameworks that I describe below (e.g., causal inference). It should also be noted that the below frameworks can be studied through the lens of both frequentist and Bayesian statistics. - This list below is by no means exhaustive. Indeed, there are many active areas of research in Statistics, some which I may have missed. If you think that I have missed something important, please feel free to describe it in the comments.
- It is impossible to definitively predict what will be “popular” or “state-of-the-art” in the next few years. In particular, the widespread adoption of artificial intelligence (AI) in the past several years was not something that I could have foreseen years ago. Something new and unpredictable could always take off. However, I believe that the below areas are of broad current interest and will continue to be of enduring interest for the forseeable future.
1. Generative Modeling and General Transport Problems
The advent of generative models has become a game-changer in many fields, e.g., medical imaging and electronic health records (EHRs). For example, it can be very costly and time-consuming to procure real brain scans and EHR data to study and diagnose disorders such as Alzheimer’s. However, with deep generative models (i.e., generative models parameterized by deep neural networks), it is now possible to generate very high-fidelity synthetic brain images and synthetic patient data using limited training data.On a surface level, it may not be clear what generative models have to do with statistics. However, most deep generative models are implemented by mapping a reference probability distribution (such as a standard Gaussian) to a target probability distribution (for example, a distribution over images). In statistics, we encounter intractable distributions all the time, e.g., posterior distributions, conditional distributions, latent distributions, bootstrap sampling distributions, etc. It is of interest to adopt the idea of generative modeling to generate novel instances from these complex distributions so that we can learn them implicitly. This problem also lies within the broader framework of general transport problems where one aims to transform one probability distribution to another. To this end, mathematical tools such as optimal transport have gained considerable attention in the statistics and machine learning literature.
Using synthetic data as a surrogate for real data is also not a new idea. For example, in the analysis and design of many real-world physical systems, it is preferred to mimic these systems using simulated data and surrogate/computer models rather than performing actual experiments (e.g., it may not be feasible to expose a real aircraft to very dangerous weather conditions). Generative models represent a new frontier for generating high-fidelity synthetic data. However, in order for synthetic data created by deep generative models to be useful for clinical decision-making and downstream analyses, statistical inference methods are critical for ensuring the trustworthiness of these synthetic data. Thus, a very active area of current research is fidelity quantification, where statistical methods play a crucial role in assessing the fidelity of outputs from deep generative models.
2. Learning from Decentralized and/or Non-Static Datasets
Many “classical” machine learning methods were originally designed for static datasets on a single server. However, nowadays, it is increasingly common to encounter non-static datasets and decentralized data (i.e., data that is distributed across many nodes or clients).A canonical example of non-static data is streaming data, i.e., data that arrives in real-time such as stock market data or clickstream data. The streaming nature of such data demands that these datasets be processed quickly and immediately rather than in batches. This requires the development of novel online algorithms and specialized statistical inference methods to capture dependencies in the data where the i.i.d. assumption is typically inappropriate.
Decentralized data, or data that is not stored on a single server but across multiple nodes, is also very common, especially in light of data privacy concerns. For example, individual hospitals often seek to collaborate on issues like disease detection but do not want to share sensitive patient data. This has led to significant interest in federated learning, where a global statistical model is built by only training a model on local servers and then communicating the updated model parameters (rather than transferring private, local data) back-and-forth with a central server until the global model converges.
3. Learning from Pre-Trained Models
In the early 2010s, there was a huge explosion in compute power. Thus, around the time when I was doing my PhD, there was a very strong emphasis on creating statistical methods and algorithms that could scale to massive datasets. While the design of scalable algorithms is still practically relevant, a new paradigm has taken hold in recent years: instead of having to train a new model from scratch — which could be very time-consuming, can we leverage a model that was already trained on a large dataset?This is the premise behind all of the Generative Pre-trained Transformer (GPT) models and foundation models. The idea is that by fine-tuning a previously trained model, an agent such as a large language model or a self-driving car can rapidly perform a downstream task such as generating responses to a new prompt or navigating new traffic situations. The idea of pre-training also has relevance in statistics where a pre-trained model can often help to accelerate convergence of an algorithm. This is accomplished, for example, by using the pre-trained model as an intialization for the algorithm or by reducing the data requirements needed for learning a particular task (i.e., you may not need to use a massive amount of data to reliably perform this new task if you can simply tweak an existing fitted model).
Beyond the computational gains, pre-trained models also have relevance in knowledge and domain transfer. In recent years, there has been a considerable interest in transfer learning. Here, a model trained on one task (i.e., the source task) can be reused and tweaked for a second related task (i.e., the target task). This potentially allows the knowledge gained from the source task to boost performance in the target task, especially if data is limited or scarce in the target domain. Effectively addressing major challenges in transfer learning, e.g., negative transfer, will be an impotant issue to tackle in the years to come.
4. Learning Under Data Heterogeneity
Heterogeneity is frequently observed in real datasets and violates standard i.i.d. assumptions, which poses significant challenges for “traditional” statistical analyses. Current efforts for handling heterogeneous data include two major research directions: 1) ensuring reliable predictions and statistical inferences when heterogeneity is present, and 2) integrating heterogeneous data sources together to improve estimation and prediction.Many statistical and machine learning models rely on fitting a model to a training dataset and then using this model to predict on unseen test data. However, if data shift occurs in the test data, then the model performance degrades and may produce biased predictions for the test data. For example, a model trained on younger adults may not work as well for predicting a certain health outcome in senior citizens. Handling the various types of data shifts, including covariate shift, label shift, and concept shift, is a major direction of current research. These issues are especially pertinent in transfer learning and federated learning, where data heterogeneity can significantly degrade the performance of the target model or global model.
Besides accounting for data shifts, there is also a great amount of interest in combining heterogeneous data sources and types together to improve model performance. This is also known as data fusion. For example, in evidence-based medicine, a major trend is combining data from randomized clinical trials (RCTs) with real-world data (RWD) to both improve statistical efficiency and account for treatment effect heterogeneity. More generally, multimodal learning aims to integrate data from diverse sources (e.g. combining medical images, text data from clinical notes, and genomic information to predict a disease status) in order to enhance model generalizability, robustness, and predictive accuracy. In many cases, multimodal learning models reduce uncertainty and provide more compehensive, context-aware understanding than single-mode models which only use a single data source.
5. Causal Inference
Whenever we teach simple linear regression, one of the first things we caution students is that “correlation does not equal causation.” However, in many practical scenarios (e.g., evidence-based medicine), a cause-and-effect relationship is of greater interest than an association, especially if we aim to implement an intervention or public policy. We would typically be more interested in whether such an action (or treatment) actually works, rather than whether it is merely associated with a better outcome. Classical regression models can infer associations but not causal effects. Therefore, many statisticians have been working on moving beyond association to causality. In many adjacent fields to statistics such as econometrics and epidemiology, there has also been a noticeable long-term shift towards using causal inference to understand disease or economic etiologies and to evaluate the effectiveness of interventions and policies.Randomized controlled trials (RCTs) are the gold standard for determining causal relationships. However, RCTs are frequently infeasible or unethical. As a result, causal inference often needs to be performed using observational data. Causal inference methods attempt to mimic an RCT by using statistical methods to create treatment and control groups that are comparable with respect to all relevant confounders and then inferring the average treatment effect (ATE).
Driven by the fact that treatment effects often vary significantly across subgroups in a population, there has been a greater emphasis in the past decade on estimating heterogeneous treatment effects (HTEs). This allows practitioners to identify which subpopulations benefit (or are potentially harmed) the most from a treatment. The increased emphasis on HTEs in areas such as precision medicine also coincides with the greater integration of machine learning methods (e.g., random forests or Bayesian additive regression trees) with causal inference. Advanced causal inference methods that address challenges such as treatment non-compliance, interference/spillover (where the treatment of one unit affects the outcome of other units), and unmeasured or hidden confounding are also being developed. Finally, there is also interest in causal discovery, i.e. finding the structure of causal relationships rather than estimating the magnitudes of causal effects.
6. Dynamic Decision-Making and Reinforcement Learning
Many real-world decision-making problems are made in a dynamic environment which requires constant adaptation to the current environment or state. Reinforcement learning provides a framework for optimizing such sequential decision-making processes. In particular, reinforcement learning enables agents to learn optimal behaviors (or an optimal policy) through trial-and-error and interaction with a dynamic environment to maximize long-term cumulative rewards. Given the explosion of precision/personalized medicine, a very active area of research at the intersection of reinforcement learning and causal inference is dynamic treatment regimes (DTRs) where individual treatment decisions are adaptively learned based on the patient’s changing health status over time.Traditional reinforcement learning has focused on online algorithms, where the agent learns and updates its policy through real-time interaction with the environment. However, in many safety-critical applications such as healthcare and autonomous vehicles, it is very costly and dangerous to allow real-time environmental interactions. Therefore, a significant area of current research is offline reinforcement learning, where RL decision-making agents can be trained using pre-collected, static (offline) datasets. However, data shift and data heterogeneity are foundational challenges with offline reinforcement. Data shift causes overestimation of Q-values due to out-of-distribution actions, while data heterogeneity makes learning a single optimal policy difficult.
Due to the gap between fixed offline data and the dynamic data encountered from online exploration, there have been current attempts to bridge these two learning paradigms under the framework of offline-to-online reinforcement learning (O2O RL). O2O RL often improves sample efficiency and mitigates the safety risks of pure online training. By using a pre-trained policy learned from offline data as a starting point, the decision-making agent often requires significantly fewer online interactions to learn an optimal policy when deployed in an online environment. At the same time, online fine-tuning on dynamic data also helps to overcome practical issues such as suboptimal offline data.
7. Extracting Interpretable Insights from Black-Box Models
Black-box models such as tree ensemble methods (e.g., boosting, random forests, and Bayesian additive regression trees), neural networks, and Gaussian processes are considered the current state-of-the-art for prediction and classification tasks. This is because these methods are very flexible and capable of capturing highly complex and nonlinear higher-order interactions, while being less prone to the curse of dimensionality than more traditional nonparametric methods. However, they remain a bit opaque. As a result, there has been a great deal of effort devoted to making these models more interpretable.One avenue for making these models more interpretable is through trustworthy uncertainty quantification. Conformal inference is a very popular method for calibrating black-box models. Conformal inference transforms point predictions into distribution-free prediction sets/intervals with guaranteed finite-sample coverage. Another avenue is through feature selection and feature extraction, e.g. representation learning. Feature selection aims to identify the most relevant individual features. Meanwhile, representation learning (e.g., disentanglement in variational autoencoder models) extracts hidden structure from the data in the form of compact, lower-dimensional, and semantically meaningful latent factors.
Finally, merging black-box models with ideas from causal inference can improve the interpretability of black boxes. One popular avenue right now is through counterfactual explanations, or identifying the minimal input changes that would fundamentally alter a model’s prediction. This can provide practitioners with practical insights like providing clear guidance on how to change a future outcome. Counterfactual explanations can also unveil when and whether decisions made by black boxes are due to spurious correlations vs. true causal relationships.
What’s the role of “classical” statistics, particularly mathematical statistics and probability theory?
Given the vast changes in the field of Statistics, one may wonder what the role of “classical” statistics is, particularly with regard to mathematical statistics and probability theory.First, tools from “traditional” probability theory and mathematical statistics are still useful in these new paradigms and frameworks. Statistical theory can be used to analyze not just things like large-sample and finite-sample behavior or generalization error bounds, but theory can also help to explain certain phenomena for many popular models in surprising, and perhaps, unexpected ways. For example, I recently worked on a project where we used functional inequalities from advanced probabilty and functional analysis to characterize the fundamental behavior of transport maps in deep generative models. It is also important to build upon classical theory to develop new theoretical tools and frameworks.
Secondly, statisticians tend to be well-versed in statistical inference (e.g., uncertainty quantification and hypothesis testing) and regression models, e.g. linear regression and generalized linear models. These “classical” models — as well as other “classical” machine learning methods — can also be (and indeed have been) adapted to frameworks such as transfer learning, federated learning, and causal inference. We do not need to throw out the old or start completely from scratch. We can adapt existing tools to these new paradigms.
Finally, statisticians are experts at drawing interpretable insights from data. As black-box models and AI models have become much more prominent in decision-making, it will be essential for statisticians to use their expertise in inference to make these models more interpretable, robust, and trustworthy. This in turn can guide more informed decision-making.
How does your list compare with the list from NAS’ Frontiers of Statistics in Science and Engineering: 2035 and Beyond Release Webinar’ on 2-26-26?
I haven’t seen this. I will have to watch it when I have time, and then I may be able to compare. In any case, my summary has been shared and praised by a few internationally renowned statisticians, so I must have done a decent job summarizing at least some of the current major trends in the field. 😀