Statistics
See recent articles
 [1] arXiv:2406.13036 [pdf, html, other]

Title: Sharp detection of lowdimensional structure in probability measures via dimensional logarithmic Sobolev inequalitiesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO)
Identifying lowdimensional structure in highdimensional probability measures is an essential preprocessing step for efficient sampling. We introduce a method for identifying and approximating a target measure $\pi$ as a perturbation of a given reference measure $\mu$ along a few significant directions of $\mathbb{R}^{d}$. The reference measure can be a Gaussian or a nonlinear transformation of a Gaussian, as commonly arising in generative modeling. Our method extends prior work on minimizing majorizations of the KullbackLeibler divergence to identify optimal approximations within this class of measures. Our main contribution unveils a connection between the \emph{dimensional} logarithmic Sobolev inequality (LSI) and approximations with this ansatz. Specifically, when the target and reference are both Gaussian, we show that minimizing the dimensional LSI is equivalent to minimizing the KL divergence restricted to this ansatz. For general nonGaussian measures, the dimensional LSI produces majorants that uniformly improve on previous majorants for gradientbased dimension reduction. We further demonstrate the applicability of this analysis to the squared Hellinger distance, where analogous reasoning shows that the dimensional Poincaré inequality offers improved bounds.
 [2] arXiv:2406.13052 [pdf, html, other]

Title: Distance Covariance, Independence, and Pairwise DifferencesSubjects: Methodology (stat.ME)
(To appear in The American Statistician.) Distance covariance (Székely, Rizzo, and Bakirov, 2007) is a fascinating recent notion, which is popular as a test for dependence of any type between random variables $X$ and $Y$. This approach deserves to be touched upon in modern courses on mathematical statistics. It makes use of distances of the type $XX'$ and $YY'$, where $(X',Y')$ is an independent copy of $(X,Y)$. This raises natural questions about independence of variables like $XX'$ and $YY'$, about the connection between Cov$(XX',YY')$ and the covariance between doubly centered distances, and about necessary and sufficient conditions for independence. We show some basic results and present a new and nontechnical counterexample to a common fallacy, which provides more insight. We also show some motivating examples involving bivariate distributions and contingency tables, which can be used as didactic material for introducing distance correlation.
 [3] arXiv:2406.13111 [pdf, html, other]

Title: Nonparametric Motion Control in Functional Connectivity Studies in Children with Autism Spectrum DisorderSubjects: Methodology (stat.ME)
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition associated with difficulties with social interactions, communication, and restricted or repetitive behaviors. To characterize ASD, investigators often use functional connectivity derived from restingstate functional magnetic resonance imaging of the brain. However, participants' head motion during the scanning session can induce motion artifacts. Many studies remove scans with excessive motion, which can lead to drastic reductions in sample size and introduce selection bias. To avoid such exclusions, we propose an estimand inspired by causal inference methods that quantifies the difference in average functional connectivity in autistic and nonASD children while standardizing motion relative to the low motion distribution in scans that pass motion quality control. We introduce a nonparametric estimator for motion control, called MoCo, that uses all participants and flexibly models the impacts of motion and other relevant features using an ensemble of machine learning methods. We establish largesample efficiency and multiple robustness of our proposed estimator. The framework is applied to estimate the difference in functional connectivity between 132 autistic and 245 nonASD children, of which 34 and 126 pass motion quality control. MoCo appears to dramatically reduce motion artifacts relative to no participant removal, while more efficiently utilizing participant data and accounting for possible selection biases relative to the naïve approach with participant removal.
 [4] arXiv:2406.13151 [pdf, html, other]

Title: von Mises QuasiProcesses for Bayesian Circular RegressionComments: Contribution to the Structured Probabilistic Inference & Generative Modeling workshop of ICML 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
The need for regression models to predict circular values arises in many scientific fields. In this work we explore a family of expressive and interpretable distributions over circlevalued random functions related to Gaussian processes targeting two Euclidean dimensions conditioned on the unit circle. The resulting probability model has connections with continuous spin models in statistical physics. Moreover, its density is very simple and has maximumentropy, unlike previous Gaussian processbased approaches, which use wrapping or radial marginalization. For posterior inference, we introduce a new Stratonovichlike augmentation that lends itself to fast Markov Chain Monte Carlo sampling. We argue that transductive learning in these models favors a Bayesian approach to the parameters. We present experiments applying this model to the prediction of (i) wind directions and (ii) the percentage of the running gait cycle as a function of joint angles.
 [5] arXiv:2406.13154 [pdf, html, other]

Title: Conditional scorebased diffusion models for solving inverse problems in mechanicsAgnimitra Dasgupta, Harisankar Ramaswamy, Javier Murgoitio Esandi, Ken Foo, Runze Li, Qifa Zhou, Brendan Kennedy, Assad OberaiSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We propose a framework to perform Bayesian inference using conditional scorebased diffusion models to solve a class of inverse problems in mechanics involving the inference of a specimen's spatially varying material properties from noisy measurements of its mechanical response to loading. Conditional scorebased diffusion models are generative models that learn to approximate the score function of a conditional distribution using samples from the joint distribution. More specifically, the score functions corresponding to multiple realizations of the measurement are approximated using a single neural network, the socalled score network, which is subsequently used to sample the posterior distribution using an appropriate Markov chain Monte Carlo scheme based on Langevin dynamics. Training the score network only requires simulating the forward model. Hence, the proposed approach can accommodate blackbox forward models and complex measurement noise. Moreover, once the score network has been trained, it can be reused to solve the inverse problem for different realizations of the measurements. We demonstrate the efficacy of the proposed approach on a suite of highdimensional inverse problems in mechanics that involve inferring heterogeneous material properties from noisy measurements. Some examples we consider involve synthetic data, while others include data collected from actual elastography experiments. Further, our applications demonstrate that the proposed approach can handle different measurement modalities, complex patterns in the inferred quantities, nonGaussian and nonadditive noise models, and nonlinear blackbox forward models. The results show that the proposed framework can solve largescale physicsbased inverse problems efficiently.
 [6] arXiv:2406.13197 [pdf, html, other]

Title: Representation Transfer Learning for Semiparametric RegressionComments: 42 pages, 11 figures, 5 tablesSubjects: Methodology (stat.ME)
We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larger than that of the target data. This knowledge transfer is carried out by the sharing of data representations, predicated on the idea that there exists a set of latent representations transferable from the source to the target domain. We address model heterogeneity between the source and target domains by incorporating domainspecific parameters in their respective models. We establish sufficient conditions for the identifiability of the models and demonstrate that the estimator for the primary parameter in the target model is both consistent and asymptotically normal. These results lay the theoretical groundwork for making statistical inferences about the main effects. Our simulation studies highlight the benefits of our method, and we further illustrate its practical applications using realworld data.
 [7] arXiv:2406.13310 [pdf, html, other]

Title: A finiteinfinite shared atoms nested model for the Bayesian analysis of large grouped dataSubjects: Methodology (stat.ME); Applications (stat.AP)
The use of hierarchical mixture priors with shared atoms has recently flourished in the Bayesian literature for partially exchangeable data. Leveraging on nested levels of mixtures, these models allow the estimation of a twolayered data partition: across groups and across observations. This paper discusses and compares the properties of such modeling strategies when the mixing weights are assigned either a finitedimensional Dirichlet distribution or a Dirichlet process prior. Based on these considerations, we introduce a novel hierarchical nonparametric prior based on a finite set of shared atoms, a specification that enhances the flexibility of the induced random measures and the availability of fast posterior inference. To support these findings, we analytically derive the induced prior correlation structure and partially exchangeable partition probability function. Additionally, we develop a novel meanfield variational algorithm for posterior inference to boost the applicability of our nested model to large multivariate data. We then assess and compare the performance of the different sharedatom specifications via simulation. We also show that our variational proposal is highly scalable and that the accuracy of the posterior density estimate and the estimated partition is comparable with stateoftheart Gibbs sampler algorithms. Finally, we apply our model to a real dataset of Spotify's song features, simultaneously segmenting artists and songs with similar characteristics.
 [8] arXiv:2406.13425 [pdf, html, other]

Title: Coupled InputOutput Dimension Reduction: Application to Goaloriented Bayesian Experimental Design and Global Sensitivity AnalysisSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We introduce a new method to jointly reduce the dimension of the input and output space of a highdimensional function. Choosing a reduced input subspace influences which output subspace is relevant and vice versa. Conventional methods focus on reducing either the input or output space, even though both are often reduced simultaneously in practice. Our coupled approach naturally supports goaloriented dimension reduction, where either an input or output quantity of interest is prescribed. We consider, in particular, goaloriented sensor placement and goaloriented sensitivity analysis, which can be viewed as dimension reduction where the most important output or, respectively, input components are chosen. Both applications present difficult combinatorial optimization problems with expensive objectives such as the expected information gain and Sobol indices. By optimizing gradientbased bounds, we can determine the most informative sensors and most sensitive parameters as the largest diagonal entries of some diagnostic matrices, thus bypassing the combinatorial optimization and objective evaluation.
 [9] arXiv:2406.13447 [pdf, other]

Title: Highprobability minimax lower boundsComments: 37 pages, 3 figuresSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
The minimax risk is often considered as a gold standard against which we can compare specific statistical procedures. Nevertheless, as has been observed recently in robust and heavytailed estimation problems, the inherent reduction of the (random) loss to its expectation may entail a significant loss of information regarding its tail behaviour. In an attempt to avoid such a loss, we introduce the notion of a minimax quantile, and seek to articulate its dependence on the quantile level. To this end, we develop highprobability variants of the classical Le Cam and Fano methods, as well as a technique to convert local minimax risk lower bounds to lower bounds on minimax quantiles. To illustrate the power of our framework, we deploy our techniques on several examples, recovering recent results in robust mean estimation and stochastic convex optimisation, as well as obtaining several new results in covariance matrix estimation, sparse linear regression, nonparametric density estimation and isotonic regression. Our overall goal is to argue that minimax quantiles can provide a finergrained understanding of the difficulty of statistical problems, and that, in wide generality, lower bounds on these quantities can be obtained via userfriendly tools.
 [10] arXiv:2406.13478 [pdf, html, other]

Title: Semiparametric Localized Principal Stratification Analysis with Continuous StrataSubjects: Methodology (stat.ME)
Principal stratification is essential for revealing causal mechanisms involving posttreatment intermediate variables. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effects. Inspired by recent research, we resolve these challenges by first using a flexible copulabased principal score model to identify principal causal effect under weak principal ignorability. We then target the local functional substitute of principal causal effect, which is statistically regular and can accurately approximate principal causal effect with vanishing bandwidth. We simplify the full efficient influence function of the local functional substitute by considering its oraclescenario alternative. This leads to a computationally efficient and straightforward estimator for the local functional substitute and principal causal effect with vanishing bandwidth. We prove the double robustness and statistical optimality of our proposed estimator, and derive its asymptotic normality for inferential purposes. We illustrate the appealing statistical performance of our proposed estimator in simulations, and apply it to two real datasets with intriguing scientific discoveries.
 [11] arXiv:2406.13488 [pdf, html, other]

Title: Approximately Equivariant Neural ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Equivariant deep learning architectures exploit symmetries in learning problems to improve the sample efficiency of neuralnetworkbased models and their ability to generalise. However, when modelling realworld data, learning problems are often not exactly equivariant, but only approximately. For example, when estimating the global temperature field from weather station observations, local topographical features like mountains break translation equivariance. In these scenarios, it is desirable to construct architectures that can flexibly depart from exact equivariance in a datadriven way. In this paper, we develop a general approach to achieving this using existing equivariant architectures. Our approach is agnostic to both the choice of symmetry group and model architecture, making it widely applicable. We consider the use of approximately equivariant architectures in neural processes (NPs), a popular family of metalearning models. We demonstrate the effectiveness of our approach on a number of synthetic and realworld regression experiments, demonstrating that approximately equivariant NP models can outperform both their nonequivariant and strictly equivariant counterparts.
 [12] arXiv:2406.13500 [pdf, html, other]

Title: GradientBoosted Generalized Linear Models for Conditional Vine CopulasSubjects: Methodology (stat.ME); Applications (stat.AP)
Vine copulas are flexible dependence models using bivariate copulas as building blocks. If the parameters of the bivariate copulas in the vine copula depend on covariates, one obtains a conditional vine copula. We propose an extension for the estimation of continuous conditional vine copulas, where the parameters of continuous conditional bivariate copulas are estimated sequentially and separately via gradientboosting. For this purpose, we link covariates via generalized linear models (GLMs) to Kendall's $\tau$ correlation coefficient from which the corresponding copula parameter can be obtained. Consequently, the gradientboosting algorithm estimates the copula parameters providing a natural covariate selection. In a second step, an additional covariate deselection procedure is applied. The performance of the gradientboosted conditional vine copulas is illustrated in a simulation study. Linear covariate effects in low and highdimensional settings are investigated for the conditional bivariate copulas separately and for conditional vine copulas. Moreover, the gradientboosted conditional vine copulas are applied to the temporal postprocessing of ensemble weather forecasts in a lowdimensional setting. The results show, that our suggested method is able to outperform the benchmark methods and identifies temporal correlations better. Eventually, we provide an Rpackage called boostCopula for this method.
 [13] arXiv:2406.13513 [pdf, html, other]

Title: Sharp oracle inequalities and universality of the AIC and FPEComments: 89 pages, 3 figuresSubjects: Statistics Theory (math.ST); Probability (math.PR)
In two landmark papers, Akaike introduced the AIC and FPE, demonstrating their significant usefulness for prediction. In subsequent seminal works, Shibata developed a notion of asymptotic efficiency and showed that both AIC and FPE are optimal, setting the stage for decadeslong developments and research in this area and beyond. Conceptually, the theory of efficiency is universal in the sense that it (formally) only relies on secondorder properties of the underlying process $(X_t)_{t\in \mathbb{Z}}$, but, so far, almost all (efficiency) results require the much stronger assumption of a linear process with independent innovations. In this work, we establish sharp oracle inequalities subject only to a very general notion of weak dependence, establishing a universal property of the AIC and FPE. A direct corollary of our inequalities is asymptotic efficiency of these criteria. Our framework contains many prominent dynamical systems such as random walks on the regular group, functionals of iterated random systems, functionals of (augmented) Garch models of any order, functionals of (Banach space valued) linear processes, possibly infinite memory Markov chains, dynamical systems arising from SDEs, and many more.
 [14] arXiv:2406.13619 [pdf, html, other]

Title: Generative Modeling by Minimizing the Wasserstein2 LossSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper approaches the unsupervised learning problem by minimizing the secondorder Wasserstein loss (the $W_2$ loss). The minimization is characterized by a distributiondependent ordinary differential equation (ODE), whose dynamics involves the Kantorovich potential between a current estimated distribution and the true data distribution. A main result shows that the timemarginal law of the ODE converges exponentially to the true data distribution. To prove that the ODE has a unique solution, we first construct explicitly a solution to the associated nonlinear FokkerPlanck equation and show that it coincides with the unique gradient flow for the $W_2$ loss. Based on this, a unique solution to the ODE is built from Trevisan's superposition principle and the exponential convergence results. An Euler scheme is proposed for the distributiondependent ODE and it is shown to correctly recover the gradient flow for the $W_2$ loss in the limit. An algorithm is designed by following the scheme and applying persistent training, which is natural in our gradientflow framework. In both low and highdimensional experiments, our algorithm converges much faster than and outperforms Wasserstein generative adversarial networks, by increasing the level of persistent training appropriately.
 [15] arXiv:2406.13635 [pdf, html, other]

Title: Temporal label recovery from noisy dynamical dataComments: 20 pages, 4 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
Analyzing dynamical data often requires information of the temporal labels, but such information is unavailable in many applications. Recovery of these temporal labels, closely related to the seriation or sequencing problem, becomes crucial in the study. However, challenges arise due to the nonlinear nature of the data and the complexity of the underlying dynamical system, which may be periodic or nonperiodic. Additionally, noise within the feature space complicates the theoretical analysis. Our work develops spectral algorithms that leverage manifold learning concepts to recover temporal labels from noisy data. We first construct the graph Laplacian of the data, and then employ the second (and the third) Fiedler vectors to recover temporal labels. This method can be applied to both periodic and aperiodic cases. It also does not require monotone properties on the similarity matrix, which are commonly assumed in existing spectral seriation algorithms. We develop the $\ell_{\infty}$ error of our estimators for the temporal labels and ranking, without assumptions on the eigengap. In numerical analysis, our method outperforms spectral seriation algorithms based on a similarity matrix. The performance of our algorithms is further demonstrated on a synthetic biomolecule data example.
 [16] arXiv:2406.13691 [pdf, html, other]

Title: Computationally efficient multilevel Gaussian process regression for functional data observed under completely or partially regular sampling designsComments: 48 pages, 3 figuresSubjects: Methodology (stat.ME); Computation (stat.CO)
Gaussian process regression is a frequently used statistical method for flexible yet fully probabilistic nonlinear regression modeling. A common obstacle is its computational complexity which scales poorly with the number of observations. This is especially an issue when applying Gaussian process models to multiple functions simultaneously in various applications of functional data analysis.
We consider a multilevel Gaussian process regression model where a common mean function and individual subjectspecific deviations are modeled simultaneously as latent Gaussian processes. We derive exact analytic and computationally efficient expressions for the loglikelihood function and the posterior distributions in the case where the observations are sampled on either a completely or partially regular grid. This enables us to fit the model to large data sets that are currently computationally inaccessible using a standard implementation. We show through a simulation study that our analytic expressions are several orders of magnitude faster compared to a standard implementation, and we provide an implementation in the probabilistic programming language Stan.  [17] arXiv:2406.13814 [pdf, html, other]

Title: Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning ApproachesComments: 47 pages, 3 tables, 8 figuresSubjects: Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. TwoStage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (KNearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates.
 [18] arXiv:2406.13833 [pdf, html, other]

Title: Cluster Quilting: Spectral Clustering for Patchwork LearningSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Patchwork learning arises as a new and challenging data collection paradigm where both samples and features are observed in fragmented subsets. Due to technological limits, measurement expense, or multimodal data integration, such patchwork data structures are frequently seen in neuroscience, healthcare, and genomics, among others. Instead of analyzing each data patch separately, it is highly desirable to extract comprehensive knowledge from the whole data set. In this work, we focus on the clustering problem in patchwork learning, aiming at discovering clusters amongst all samples even when some are never jointly observed for any feature. We propose a novel spectral clustering method called Cluster Quilting, consisting of (i) patch ordering that exploits the overlapping structure amongst all patches, (ii) patchwise SVD, (iii) sequential linear mapping of top singular vectors for patch overlaps, followed by (iv) kmeans on the combined and weighted singular vectors. Under a subGaussian mixture model, we establish theoretical guarantees via a nonasymptotic misclustering rate bound that reflects both properties of the patchwise observation regime as well as the clustering signal and noise dependencies. We also validate our Cluster Quilting algorithm through extensive empirical studies on both simulated and real data sets in neuroscience and genomics, where it discovers more accurate and scientifically more plausible clusters than other approaches.
 [19] arXiv:2406.13836 [pdf, html, other]

Title: Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox RegressionsSubjects: Methodology (stat.ME)
In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample size. To bridge this gap, our work introduces tools designed for choosing the optimal subsample size. We focus on three settings: the Cox regression model for survival data with rare events and logistic regression for both balanced and imbalanced datasets. Additionally, we present a novel optimal subsampling procedure tailored for logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets.
 [20] arXiv:2406.13876 [pdf, html, other]

Title: An Empirical Bayes Jackknife Regression Framework for Covariance Matrix EstimationComments: 13 pages, 3 figuresSubjects: Methodology (stat.ME)
Covariance matrix estimation, a classical statistical topic, poses significant challenges when the sample size is comparable to or smaller than the number of features. In this paper, we frame covariance matrix estimation as a compound decision problem and apply an optimal decision rule to estimate covariance parameters. To approximate this rule, we introduce an algorithm that integrates jackknife techniques with machine learning regression methods. This algorithm exhibits adaptability across diverse scenarios without relying on assumptions about data distribution. Simulation results and gene network inference from an RNAseq experiment in mice demonstrate that our approach either matches or surpasses several stateoftheart methods
 [21] arXiv:2406.13906 [pdf, html, other]

Title: Semisupervised Regression Analysis with Model Misspecification and Highdimensional DataSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
The accessibility of vast volumes of unlabeled data has sparked growing interest in semisupervised learning (SSL) and covariate shift transfer learning (CSTL). In this paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecification of conditional mean models. We develop an augmented inverse probability weighted (AIPW) method, employing regularized calibrated estimators for both propensity score (PS) and outcome regression (OR) nuisance models, with PS and OR models being sequentially dependent. We show that when the PS model is correctly specified, the proposed estimator achieves consistency, asymptotic normality, and valid confidence intervals, even with possible OR model misspecification and highdimensional data. Moreover, by suppressing detailed technical choices, we demonstrate that previous methods can be unified within our AIPW framework. Our theoretical findings are verified through extensive simulation studies and a realworld data application.
 [22] arXiv:2406.13936 [pdf, html, other]

Title: CommunicationEfficient Adaptive Batch Size Strategies for Distributed Local Gradient MethodsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
Modern deep neural networks often require distributed training with many workers due to their large size. As worker numbers increase, communication overheads become the main bottleneck in dataparallel minibatch stochastic gradient methods with periteration gradient synchronization. Local gradient methods like Local SGD reduce communication by only syncing after several local steps. Despite understanding their convergence in i.i.d. and heterogeneous settings and knowing the importance of batch sizes for efficiency and generalization, optimal local batch sizes are difficult to determine. We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance. We provide convergence guarantees under homogeneous data conditions and support our claims with image classification experiments, demonstrating the effectiveness of our strategies in training and generalization.
 [23] arXiv:2406.13938 [pdf, html, other]

Title: Coverage of Credible Sets for Regression under Variable SelectionSubjects: Methodology (stat.ME)
We study the asymptotic frequentist coverage of credible sets based on a novel Bayesian approach for a multiple linear regression model under variable selection. We initially ignore the issue of variable selection, which allows us to put a conjugate normal prior on the coefficient vector. The variable selection step is incorporated directly in the posterior through a sparsityinducing map and uses the induced prior for making an inference instead of the natural conjugate posterior. The sparsityinducing map minimizes the sum of the squared l2distance weighted by the data matrix and a suitably scaled l1penalty term. We obtain the limiting coverage of various credible regions and demonstrate that a modified credible interval for a component has the exact asymptotic frequentist coverage if the corresponding predictor is asymptotically uncorrelated with other predictors. Through extensive simulation, we provide a guideline for choosing the penalty parameter as a function of the credibility level appropriate for the corresponding coverage. We also show finitesample numerical results that support the conclusions from the asymptotic theory. We also provide the credInt package that implements the method in R to obtain the credible intervals along with the posterior samples.
 [24] arXiv:2406.13944 [pdf, html, other]

Title: Generalization error of minnorm interpolators in transfer learningComments: 53 pages, 2 figuresSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
This paper establishes the generalization error of pooled min$\ell_2$norm interpolation in transfer learning where data from diverse distributions are available. Minnorm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their outofdistribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of minnorm interpolation in this setting are not wellunderstood. We address this gap by characterizing the bias and variance of pooled min$\ell_2$norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signaltonoise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shifttosignal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a datadriven method to determine: (i) when the pooled interpolator outperforms the targetbased interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finitesample efficacy of our results.
 [25] arXiv:2406.13989 [pdf, html, other]

Title: Random pairing MLE for estimation of item parameters in Rasch modelSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
The Rasch model, a classical model in the item response theory, is widely used in psychometrics to model the relationship between individuals' latent traits and their binary responses on assessments or questionnaires. In this paper, we introduce a new likelihoodbased estimator  random pairing maximum likelihood estimator ($\mathsf{RP\text{}MLE}$) and its bootstrapped variant multiple random pairing MLE ($\mathsf{MRP\text{}MLE}$) that faithfully estimate the item parameters in the Rasch model. The new estimators have several appealing features compared to existing ones. First, both work for sparse observations, an increasingly important scenario in the big data era. Second, both estimators are provably minimax optimal in terms of finite sample $\ell_{\infty}$ estimation error. Lastly, $\mathsf{RP\text{}MLE}$ admits precise distributional characterization that allows uncertainty quantification on the item parameters, e.g., construction of confidence intervals of the item parameters. The main idea underlying $\mathsf{RP\text{}MLE}$ and $\mathsf{MRP\text{}MLE}$ is to randomly pair useritem responses to form itemitem comparisons. This is carefully designed to reduce the problem size while retaining statistical independence. We also provide empirical evidence of the efficacy of the two new estimators using both simulated and real data.
 [26] arXiv:2406.13995 [pdf, html, other]

Title: Prediction of Unobserved Bifurcation by Unsupervised Extraction of Slowly TimeVarying System Parameter Dynamics from Time Series Using Reservoir ComputingComments: 17 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
Nonlinear and nonstationary processes are prevalent in various natural and physical phenomena, where system dynamics can change qualitatively due to bifurcation phenomena. Traditional machine learning methods have advanced our ability to learn and predict such systems from observed time series data. However, predicting the behavior of systems with temporal parameter variations without knowledge of true parameter values remains a significant challenge. This study leverages the reservoir computing framework to address this problem by unsupervised extraction of slowly varying system parameters from time series data. We propose a model architecture consisting of a slow reservoir with long timescale internal dynamics and a fast reservoir with short timescale dynamics. The slow reservoir extracts the temporal variation of system parameters, which are then used to predict unknown bifurcations in the fast dynamics. Through experiments using data generated from chaotic dynamical systems, we demonstrate the ability to predict bifurcations not present in the training data. Our approach shows potential for applications in fields such as neuroscience, material science, and weather prediction, where slow dynamics influencing qualitative changes are often unobservable.
 [27] arXiv:2406.14003 [pdf, html, other]

Title: Deep Optimal Experimental Design for Parameter Estimation ProblemsSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
Optimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bilevel optimization problem that is inherent in optimal experimental design for nonlinear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations.
 [28] arXiv:2406.14009 [pdf, html, other]

Title: Confidence Intervals and Simultaneous Confidence Bands Based on Deep LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Deep learning models have significantly improved prediction accuracy in various fields, gaining recognition across numerous disciplines. Yet, an aspect of deep learning that remains insufficiently addressed is the assessment of prediction uncertainty. Producing reliable uncertainty estimators could be crucial in practical terms. For instance, predictions associated with a high degree of uncertainty could be sent for further evaluation. Recent works in uncertainty quantification of deep learning predictions, including Bayesian posterior credible intervals and a frequentist confidenceinterval estimation, have proven to yield either invalid or overly conservative intervals. Furthermore, there is currently no method for quantifying uncertainty that can accommodate deep neural networks for survival (timetoevent) data that involves rightcensored outcomes. In this work, we provide a valid nonparametric bootstrap method that correctly disentangles data uncertainty from the noise inherent in the adopted optimization algorithm, ensuring that the resulting pointwise confidence intervals or the simultaneous confidence bands are accurate (i.e., valid and not overly conservative). The proposed adhoc method can be easily integrated into any deep neural network without interfering with the training process. The utility of the proposed approach is illustrated by constructing simultaneous confidence bands for survival curves derived from deep neural networks for survival data with right censoring.
 [29] arXiv:2406.14033 [pdf, other]

Title: Ensembles of Probabilistic Regression TreesAlexandre Seiller, Éric Gaussier (APTIKAL), Emilie Devijver (APTIKAL), Marianne Clausel (IECL), Sami AlkhourySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Treebased ensemble methods such as random forests, gradientboosted trees, and Bayesianadditive regression trees have been successfully used for regression problems in many applicationsand research studies. In this paper, we study ensemble versions of probabilisticregression trees that provide smooth approximations of the objective function by assigningeach observation to each region with respect to a probability distribution. We prove thatthe ensemble versions of probabilistic regression trees considered are consistent, and experimentallystudy their biasvariance tradeoff and compare them with the stateoftheart interms of performance prediction.
 [30] arXiv:2406.14040 [pdf, html, other]

Title: A Practical Diffusion Path for SamplingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Diffusion models are stateoftheart methods in generative modeling when samples from a target probability distribution are available, and can be efficiently sampled, using score matching to estimate score vectors guiding a Langevin process. However, in the setting where samples from the target are not available, e.g. when this target's density is known up to a normalization constant, the score estimation task is challenging. Previous approaches rely on Monte Carlo estimators that are either computationally heavy to implement or sampleinefficient. In this work, we propose a computationally attractive alternative, relying on the socalled dilation path, that yields score vectors that are available in closedform. This path interpolates between a Dirac and the target distribution using a convolution. We propose a simple implementation of Langevin dynamics guided by the dilation path, using adaptive stepsizes. We illustrate the results of our sampling method on a range of tasks, and shows it performs better than classical alternatives.
 [31] arXiv:2406.14071 [pdf, html, other]

Title: Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear BanditsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian bandit algorithms with approximate Bayesian inference have been widely used in realworld applications. Nevertheless, their theoretical justification is less investigated in the literature, especially for contextual bandit problems. To fill this gap, we propose a general theoretical framework to analyze stochastic linear bandits in the presence of approximate inference and conduct regret analysis on two Bayesian bandit algorithms, Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound, namely Linear Bayesian Upper Confidence Bound (LinBUCB). We demonstrate that both LinTS and LinBUCB can preserve their original rates of regret upper bound but with a sacrifice of larger constant terms when applied with approximate inference. These results hold for general Bayesian inference approaches, under the assumption that the inference error measured by two different $\alpha$divergences is bounded. Additionally, by introducing a new definition of wellbehaved distributions, we show that LinBUCB improves the regret rate of LinTS from $\tilde{O}(d^{3/2}\sqrt{T})$ to $\tilde{O}(d\sqrt{T})$, matching the minimax optimal rate. To our knowledge, this work provides the first regret bounds in the setting of stochastic linear bandits with bounded approximate inference errors.
 [32] arXiv:2406.14140 [pdf, html, other]

Title: Nonparametric Jackknife Instrumental Variable Estimation and Confounding Robust Surrogate IndicesSubjects: Statistics Theory (math.ST)
Jackknife instrumental variable estimation (JIVE) is a classic method to leverage many weak instrumental variables (IVs) to estimate linear structural models, overcoming the bias of standard methods like twostage least squares. In this paper, we extend the jackknife approach to nonparametric IV (NPIV) models with many weak IVs. Since NPIV characterizes the structural regression as having residuals projected onto the IV being zero, existing approaches minimize an estimate of the average squared projected residuals, but their estimates are biased under many weak IVs. We introduce an IV splitting device inspired by JIVE to remove this bias, and by carefully studying this splitIV empirical process we establish learning rates that depend on generic complexity measures of the nonparametric hypothesis class. We then turn to leveraging this for semiparametric inference on average treatment effects (ATEs) on unobserved longterm outcomes predicted from shortterm surrogates, using historical experiments as IVs to learn this nonparametric predictive relationship even in the presence of confounding between short and longterm observations. Using splitIV estimates of a debiasing nuisance, we develop asymptotically normal estimates for predicted ATEs, enabling inference.
 [33] arXiv:2406.14145 [pdf, html, other]

Title: Temperature in the Iberian Peninsula: Trend, seasonality, and heterogeneityComments: 49 pages, 20 figuresSubjects: Applications (stat.AP); Econometrics (econ.EM)
In this paper, we propose fitting unobserved component models to represent the dynamic evolution of bivariate systems of centre and logrange temperatures obtained monthly from minimum/maximum temperatures observed at a given location. In doing so, the centre and logrange temperature are decomposed into potentially stochastic trends, seasonal, and transitory components. Since our model encompasses deterministic trends and seasonal components as limiting cases, we contribute to the debate on whether stochastic or deterministic components better represent the trend and seasonal components. The methodology is implemented to centre and logrange temperature observed in four locations in the Iberian Peninsula, namely, Barcelona, Coruña, Madrid, and Seville. We show that, at each location, the centre temperature can be represented by a smooth integrated random walk with timevarying slope, while a stochastic level better represents the logrange. We also show that centre and logrange temperature are unrelated. The methodology is then extended to simultaneously model centre and logrange temperature observed at several locations in the Iberian Peninsula. We fit a multilevel dynamic factor model to extract potential commonalities among centre (logrange) temperature while also allowing for heterogeneity in different areas in the Iberian Peninsula. We show that, although the commonality in trends of average temperature is considerable, the regional components are also relevant.
 [34] arXiv:2406.14159 [pdf, html, other]

Title: Enhancing multivariate postprocessed visibility predictions utilizing CAMS forecastsComments: 23 pages, 10 figuresSubjects: Applications (stat.AP); Machine Learning (stat.ML)
In our contemporary era, meteorological weather forecasts increasingly incorporate ensemble predictions of visibility  a parameter of great importance in aviation, maritime navigation, and air quality assessment, with direct implications for public health. However, this weather variable falls short of the predictive accuracy achieved for other quantities issued by meteorological centers. Therefore, statistical postprocessing is recommended to enhance the reliability and accuracy of predictions. By estimating the predictive distributions of the variables with the aid of historical observations and forecasts, one can achieve statistical consistency between true observations and ensemble predictions. Visibility observations, following the recommendation of the World Meteorological Organization, are typically reported in discrete values; hence, the predictive distribution of the weather quantity takes the form of a discrete parametric law. Recent studies demonstrated that the application of classification algorithms can successfully improve the skill of such discrete forecasts; however, a frequently emerging issue is that certain spatial and/or temporal dependencies could be lost between marginals. Based on visibility ensemble forecasts of the European Centre for MediumRange Weather Forecasts for 30 locations in Central Europe, we investigate whether the inclusion of Copernicus Atmosphere Monitoring Service (CAMS) predictions of the same weather quantity as an additional covariate could enhance the skill of the postprocessing methods and whether it contributes to the successful integration of spatial dependence between marginals. Our study confirms that postprocessed forecasts are substantially superior to raw and climatological predictions, and the utilization of CAMS forecasts provides a further significant enhancement both in the univariate and multivariate setup.
 [35] arXiv:2406.14182 [pdf, html, other]

Title: Averaging polyhazard models using Piecewise deterministic Monte Carlo with applications to data with longterm survivorsComments: 22 pages, 9 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Polyhazard models are a class of flexible parametric models for modelling survival over extended time horizons. Their additive hazard structure allows for flexible, nonproportional hazards whose characteristics can change over time while retaining a parametric form, which allows for survival to be extrapolated beyond the observation period of a study. Significant user input is required, however, in selecting the number of latent hazards to model, their distributions and the choice of which variables to associate with each hazard. The resulting set of models is too large to explore manually, limiting their practical usefulness. Motivated by applications to stroke survivor and kidney transplant patient survival times we extend the standard polyhazard model through a prior structure allowing for joint inference of parameters and structural quantities, and develop a sampling scheme that utilises stateoftheart Piecewise Deterministic Markov Processes to sample from the resulting transdimensional posterior with minimal user tuning.
 [36] arXiv:2406.14184 [pdf, html, other]

Title: On integral priors for multiple comparison in Bayesian model selectionSubjects: Methodology (stat.ME)
Noninformative priors constructed for estimation purposes are usually not appropriate for model selection and testing. The methodology of integral priors was developed to get prior distributions for Bayesian model selection when comparing two models, modifying initial improper reference priors. We propose a generalization of this methodology to more than two models. Our approach adds an artificial copy of each model under comparison by compactifying the parametric space and creating an ergodic Markov chain across all models that returns the integral priors as marginals of the stationary distribution. Besides the garantee of their existance and the lack of paradoxes attached to estimation reference priors, an additional advantage of this methodology is that the simulation of this Markov chain is straightforward as it only requires simulations of imaginary training samples for all models and from the corresponding posterior distributions. This renders its implementation automatic and generic, both in the nested case and in the nonnested case.
 [37] arXiv:2406.14269 [pdf, html, other]

Title: Concentration of a sparse Bayesian model with Horseshoe prior in estimating highdimensional precision matrixThe Tien MaiSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
Precision matrices are crucial in many fields such as social networks, neuroscience, and economics, representing the edge structure of Gaussian graphical models (GGMs), where a zero in an offdiagonal position of the precision matrix indicates conditional independence between nodes. In highdimensional settings where the dimension of the precision matrix $p$ exceeds the sample size $n$ and the matrix is sparse, methods like graphical Lasso, graphical SCAD, and CLIME are popular for estimating GGMs. While frequentist methods are wellstudied, Bayesian approaches for (unstructured) sparse precision matrices are less explored. The graphical horseshoe estimate by \citet{li2019graphical}, applying the globallocal horseshoe prior, shows superior empirical performance, but theoretical work for sparse precision matrix estimations using shrinkage priors is limited. This paper addresses these gaps by providing concentration results for the tempered posterior with the fully specified horseshoe prior in highdimensional settings. Moreover, we also provide novel theoretical results for model misspecification, offering a general oracle inequality for the posterior.
 [38] arXiv:2406.14292 [pdf, html, other]

Title: Proximal Interacting Particle Langevin AlgorithmsComments: 50 pagesSubjects: Computation (stat.CO); Optimization and Control (math.OC); Machine Learning (stat.ML)
We introduce a class of algorithms, termed Proximal Interacting Particle Langevin Algorithms (PIPLA), for inference and learning in latent variable models whose joint probability density is nondifferentiable. Leveraging proximal Markov chain Monte Carlo (MCMC) techniques and the recently introduced interacting particle Langevin algorithm (IPLA), we propose several variants within the novel proximal IPLA family, tailored to the problem of estimating parameters in a nondifferentiable statistical model. We prove nonasymptotic bounds for the parameter estimates produced by multiple algorithms in the strongly logconcave setting and provide comprehensive numerical experiments on various models to demonstrate the effectiveness of the proposed methods. In particular, we demonstrate the utility of the proposed family of algorithms on a toy hierarchical example where our assumptions can be checked, as well as on the problems of sparse Bayesian logistic regression, sparse Bayesian neural network, and sparse matrix completion. Our theory and experiments together show that PIPLA family can be the de facto choice for parameter estimation problems in latent variable models for nondifferentiable models.
 [39] arXiv:2406.14302 [pdf, other]

Title: Identifiable Exchangeable Mechanisms for Causal Structure and Representation LearningSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Identifying latent representations or causal structures is important for good generalization and downstream task performance. However, both fields have been developed rather independently. We observe that several methods in both representation and causal structure learning rely on the same datagenerating process (DGP), namely, exchangeable but not i.i.d. (independent and identically distributed) data. We provide a unified framework, termed Identifiable Exchangeable Mechanisms (IEM), for representation and structure learning under the lens of exchangeability. IEM provides new insights that let us relax the necessary conditions for causal structure identification in exchangeable noni.i.d. data. We also demonstrate the existence of a duality condition in identifiable representation learning, leading to new identifiability results. We hope this work will pave the way for further research in causal representation learning.
 [40] arXiv:2406.14426 [pdf, html, other]

Title: Transferable Boltzmann GeneratorsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chemph); Computational Physics (physics.compph)
The generation of equilibrium samples of molecular systems has been a longstanding problem in statistical physics. Boltzmann Generators are a generative machine learning method that addresses this issue by learning a transformation via a normalizing flow from a simple prior distribution to the target Boltzmann distribution of interest. Recently, flow matching has been employed to train Boltzmann Generators for small molecular systems in Cartesian coordinates. We extend this work and propose a first framework for Boltzmann Generators that are transferable across chemical space, such that they predict zeroshot Boltzmann distributions for test molecules without being retrained for these systems. These transferable Boltzmann Generators allow approximate sampling from the target distribution of unseen systems, as well as efficient reweighting to the target Boltzmann distribution. The transferability of the proposed framework is evaluated on dipeptides, where we show that it generalizes efficiently to unseen systems. Furthermore, we demonstrate that our proposed architecture enhances the efficiency of Boltzmann Generators trained on single molecular systems.
 [41] arXiv:2406.14451 [pdf, html, other]

Title: Gradient Estimation via Differentiable MetropolisHastingsComments: 27 pages, 3 figuresSubjects: Statistics Theory (math.ST); Probability (math.PR); Computation (stat.CO)
MetropolisHastings estimates intractable expectations  can differentiating the algorithm estimate their gradients? The challenge is that MetropolisHastings trajectories are not conventionally differentiable due to the discrete accept/reject steps. Using a technique based on recoupling chains, our method differentiates through the MetropolisHastings sampler itself, allowing us to estimate gradients with respect to a parameter of otherwise intractable expectations. Our main contribution is a proof of strong consistency and a central limit theorem for our estimator under assumptions that hold in common Bayesian inference problems. The proofs augment the sampler chain with latent information, and formulate the estimator as a stopping tail functional of this augmented chain. We demonstrate our method on examples of Bayesian sensitivity analysis and optimizing a random walk Metropolis proposal.
 [42] arXiv:2406.14453 [pdf, html, other]

Title: The Effective Number of Parameters in Kernel Density EstimationSubjects: Methodology (stat.ME)
The quest for a formula that satisfactorily measures the effective degrees of freedom in kernel density estimation (KDE) is a long standing problem with few solutions. Starting from the orthogonal polynomial sequence (OPS) expansion for the ratio of the empirical to the oracle density, we show how convolution with the kernel leads to a new OPS with respect to which one may express the resulting KDE. The expansion coefficients of the two OPS systems can then be related via a kernel sensitivity matrix, and this then naturally leads to a definition of effective parameters by taking the trace of a symmetrized positive semidefinite normalized version. The resulting effective degrees of freedom (EDoF) formula is an oraclebased quantity; the first ever proposed in the literature. Asymptotic properties of the empirical EDoF are worked out through influence functions. Numerical investigations confirm the theoretical insights.
 [43] arXiv:2406.14535 [pdf, html, other]

Title: On estimation and order selection for multivariate extremes via clusteringComments: 31 pages, 12 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we introduce an extra penalty term to the wellknown simplified average silhouette width, which penalizes small cluster sizes and small dissimilarities between cluster centers. Consequently, we provide a consistent method for determining the order of a maxlinear factor model, where a typical informationbased approach is not viable. Our second contribution is a largedeviationtype analysis for estimating the discrete spectral measure through clustering methods, which serves as an assessment of the convergence quality of clusteringbased estimation for multivariate extremes. Additionally, as a third contribution, we discuss how estimating the discrete measure can lead to parameter estimations of heavytailed factor models. We also present simulations and realdata studies that demonstrate order selection and factor model estimation.
New submissions for Friday, 21 June 2024 (showing 43 of 43 entries )
 [44] arXiv:2406.12908 (crosslist from cs.LG) [pdf, html, other]

Title: Rating MultiModal TimeSeries Forecasting Models (MMTSFM) for Robustness Through a Causal LensKausik Lakkaraju, Rachneet Kaur, Zhen Zeng, Parisa Zehtabi, Sunandita Patra, Biplav Srivastava, Marco ValtortaSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
AI systems are notorious for their fragility; minor input changes can potentially cause major output swings. When such systems are deployed in critical areas like finance, the consequences of their uncertain behavior could be severe. In this paper, we focus on multimodal timeseries forecasting, where imprecision due to noisy or incorrect data can lead to erroneous predictions, impacting stakeholders such as analysts, investors, and traders. Recently, it has been shown that beyond numeric data, graphical transformations can be used with advanced visual models to achieve better performance. In this context, we introduce a rating methodology to assess the robustness of MultiModal TimeSeries Forecasting Models (MMTSFM) through causal analysis, which helps us understand and quantify the isolated impact of various attributes on the forecasting accuracy of MMTSFM. We apply our novel rating method on a variety of numeric and multimodal forecasting models in a large experimental setup (six input settings of control and perturbations, ten data distributions, time series from six leading stocks in three industries over a year of data, and five timeseries forecasters) to draw insights on robust forecasting models and the context of their strengths. Within the scope of our study, our main result is that multimodal (numeric + visual) forecasting, which was found to be more accurate than numeric forecasting in previous studies, can also be more robust in diverse settings. Our work will help different stakeholders of timeseries forecasting understand the models` behaviors along trust (robustness) and accuracy dimensions to select an appropriate model for forecasting using our rating method, leading to improved decisionmaking.
 [45] arXiv:2406.12911 (crosslist from cs.LG) [pdf, html, other]

Title: The Promise of Analog Deep Learning: Recent Advances, Challenges and OpportunitiesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Much of the presentday Artificial Intelligence (AI) utilizes artificial neural networks, which are sophisticated computational models designed to recognize patterns and solve complex problems by learning from data. However, a major bottleneck occurs during a device's calculation of weighted sums for forward propagation and optimization procedure for backpropagation, especially for deep neural networks, or networks with numerous layers. Exploration into different methods of implementing neural networks is necessary for further advancement of the area. While a great deal of research into AI hardware in both directions, analog and digital implementation widely exists, much of the existing survey works lacks discussion on the progress of analog deep learning. To this end, we attempt to evaluate and specify the advantages and disadvantages, along with the current progress with regards to deep learning, for analog implementations. In this paper, our focus lies on the comprehensive examination of eight distinct analog deep learning methodologies across multiple key parameters. These parameters include attained accuracy levels, application domains, algorithmic advancements, computational speed, and considerations of energy efficiency and power consumption. We also identify the neural networkbased experiments implemented using these hardware devices and discuss comparative performance achieved by the different analog deep learning methods along with an analysis of their current limitations. Overall, we find that Analog Deep Learning has great potential for future consumerlevel applications, but there is still a long road ahead in terms of scalability. Most of the current implementations are more proof of concept and are not yet practically deployable for largescale models.
 [46] arXiv:2406.12916 (crosslist from cs.LG) [pdf, html, other]

Title: Opening the Black Box: predicting the trainability of deep neural networks with reconstruction entropyComments: 22 pages, 5 figures, 1 tableSubjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (condmat.disnn); High Energy Physics  Theory (hepth); Machine Learning (stat.ML)
An important challenge in machine learning is to predict the initial conditions under which a given neural network will be trainable. We present a method for predicting the trainable regime in parameter space for deep feedforward neural networks, based on reconstructing the input from subsequent activation layers via a cascade of singlelayer auxiliary networks. For both MNIST and CIFAR10, we show that a single epoch of training of the shallow cascade networks is sufficient to predict the trainability of the deep feedforward network, thereby providing a significant reduction in overall training time. We achieve this by computing the relative entropy between reconstructed images and the original inputs, and show that this probe of information loss is sensitive to the phase behaviour of the network. Our results provide a concrete link between the flow of information and the trainability of deep neural networks, further elucidating the role of criticality in these systems.
 [47] arXiv:2406.12945 (crosslist from cs.LG) [pdf, other]

Title: Under the Hood of Tabular Data Generation Models: the Strong Impact of Hyperparameter TuningG. Charbel N. Kindji (IRISA, LACODAM), Lina Maria RojasBarahona, Elisa Fromont (IRISA, LACODAM), Tanguy UrvoySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We investigate the impact of datasetspecific hyperparameter, feature encoding, and architecture tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets. This study addresses the practical need for a unified evaluation of models that fully considers hyperparameter optimization. Additionally, we propose a reduced search space for each model that allows for quick optimization, achieving nearly equivalent performance at a significantly lower cost.Our benchmark demonstrates that, for most models, largescale datasetspecific tuning substantially improves performance compared to the original configurations. Furthermore, we confirm that diffusionbased models generally outperform other models on tabular data. However, this advantage is not significant when the entire tuning and training process is restricted to the same GPU budget for all models.
 [48] arXiv:2406.12949 (crosslist from qbio.QM) [pdf, html, other]

Title: Integrating timeresolved $nrf2$ geneexpression data into a full GUTS model as a proxy for toxicodynamic damage in zebrafish embryoSubjects: Quantitative Methods (qbio.QM); Dynamical Systems (math.DS); Applications (stat.AP)
The immense production of the chemical industry requires an improved predictive risk assessment that can handle constantly evolving challenges while reducing the dependency of risk assessment on animal testing. Integrating 'omics data into mechanistic models offers a promising solution by linking cellular processes triggered after chemical exposure with observed effects in the organism. With the emerging availability of timeresolved RNA data, the goal of integrating gene expression data into mechanistic models can be approached. We propose a biologically anchored TKTD model, which describes key processes that link the gene expression level of the stress regulator $nrf2$ to detoxification and lethality by associating toxicodynamic damage with $nrf2$ expression. Fitting such a model to complex datasets consisting of multiple endpoints required the combination of methods from molecular biology, mechanistic dynamic systems modeling and Bayesian inference. In this study we successfully integrate timeresolved gene expression data into TKTD models, and thus provide a method for assessing the influence of molecular markers on survival. This novel method was used to test whether, $nrf2$, can be applied to predict lethality in zebrafish embryos. With the presented approach we outline a method to successively approach the goal of a predictive risk assessment based on molecular data.
 [49] arXiv:2406.13012 (crosslist from cs.LG) [pdf, html, other]

Title: Data Plagiarism Index: Characterizing the Privacy Risk of DataCopying in Tabular Generative ModelsSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
The promise of tabular generative models is to produce realistic synthetic data that can be shared and safely used without dangerous leakage of information from the training set. In evaluating these models, a variety of methods have been proposed to measure the tendency to copy data from the training dataset when generating a sample. However, these methods suffer from either not considering datacopying from a privacy threat perspective, not being motivated by recent results in the datacopying literature or being difficult to make compatible with the high dimensional, mixed type nature of tabular data. This paper proposes a new similarity metric and Membership Inference Attack called Data Plagiarism Index (DPI) for tabular data. We show that DPI evaluates a new intuitive definition of datacopying and characterizes the corresponding privacy risk. We show that the datacopying identified by DPI poses both privacy and fairness threats to common, high performing architectures; underscoring the necessity for more sophisticated generative modeling techniques to mitigate this issue.
 [50] arXiv:2406.13060 (crosslist from cs.LG) [pdf, html, other]

Title: ScaleTranslation Equivariant Network for Oceanic Internal Solitary Wave LocalizationComments: 29 pages, 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
Internal solitary waves (ISWs) are gravity waves that are often observed in the interior ocean rather than the surface. They hold significant importance due to their capacity to carry substantial energy, thus influence pollutant transport, oil platform operations, submarine navigation, etc. Researchers have studied ISWs through optical images, synthetic aperture radar (SAR) images, and altimeter data from remote sensing instruments. However, cloud cover in optical remote sensing images variably obscures ground information, leading to blurred or missing surface observations. As such, this paper aims at altimeterbased machine learning solutions to automatically locate ISWs. The challenges, however, lie in the following two aspects: 1) the altimeter data has low resolution, which requires a strong machine learner; 2) labeling data is extremely laborintensive, leading to very limited data for training. In recent years, the grand progress of deep learning demonstrates strong learning capacity given abundant data. Besides, more recent studies on efficient learning and selfsupervised learning laid solid foundations to tackle the aforementioned challenges. In this paper, we propose to inject prior knowledge to achieve a strong and efficient learner. Specifically, intrinsic patterns in altimetry data are efficiently captured using a scaletranslation equivariant convolutional neural network (STECNN). By considering inherent symmetries in neural network design, STECNN achieves higher efficiency and better performance than baseline models. Furthermore, we also introduce prior knowledge from massive unsupervised data to enhance our solution using the SimCLR framework for pretraining. Our final solution achieves an overall better performance than baselines on our handcrafted altimetry dataset. Data and codes are available at this https URL .
 [51] arXiv:2406.13074 (crosslist from hepph) [pdf, html, other]

Title: PIPPIN: Generating variable length full events from partonsSubjects: High Energy Physics  Phenomenology (hepph); Machine Learning (cs.LG); High Energy Physics  Experiment (hepex); Machine Learning (stat.ML)
This paper presents a novel approach for directly generating full events at detectorlevel from partonlevel information, leveraging cuttingedge machine learning techniques. To address the challenge of multiplicity variations between parton and reconstructed object spaces, we employ transformers, scorebased models and normalizing flows. Our method tackles the inherent complexities of the stochastic transition between these two spaces and achieves remarkably accurate results. The combination of innovative techniques and the achieved accuracy demonstrates the potential of our approach in advancing the field and opens avenues for further exploration. This research contributes to the ongoing efforts in highenergy physics and generative modelling, providing a promising direction for enhanced precision in fast detector simulation.
 [52] arXiv:2406.13130 (crosslist from cs.LG) [pdf, html, other]

Title: Advancing Retail Data Science: Comprehensive Evaluation of Synthetic DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The evaluation of synthetic data generation is crucial, especially in the retail sector where data accuracy is paramount. This paper introduces a comprehensive framework for assessing synthetic retail data, focusing on fidelity, utility, and privacy. Our approach differentiates between continuous and discrete data attributes, providing precise evaluation criteria. Fidelity is measured through stability and generalizability. Stability ensures synthetic data accurately replicates known data distributions, while generalizability confirms its robustness in novel scenarios. Utility is demonstrated through the synthetic data's effectiveness in critical retail tasks such as demand forecasting and dynamic pricing, proving its value in predictive analytics and strategic planning. Privacy is safeguarded using Differential Privacy, ensuring synthetic data maintains a perfect balance between resembling training and holdout datasets without compromising security. Our findings validate that this framework provides reliable and scalable evaluation for synthetic retail data. It ensures high fidelity, utility, and privacy, making it an essential tool for advancing retail data science. This framework meets the evolving needs of the retail industry with precision and confidence, paving the way for future advancements in synthetic data methodologies.
 [53] arXiv:2406.13371 (crosslist from cs.LG) [pdf, other]

Title: Identifiable Causal Representation Learning: Unsupervised, MultiView, and MultiEnvironmentComments: PhD Thesis; 190 pages, 33 figures, 6 tablesJournalref: University of Cambridge, 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Causal models provide rich descriptions of complex systems as sets of mechanisms by which each variable is influenced by its direct causes. They support reasoning about manipulating parts of the system and thus hold promise for addressing some of the open challenges of artificial intelligence (AI), such as planning, transferring knowledge in changing environments, or robustness to distribution shifts. However, a key obstacle to more widespread use of causal models in AI is the requirement that the relevant variables be specified a priori, which is typically not the case for the highdimensional, unstructured data processed by modern AI systems. At the same time, machine learning (ML) has proven quite successful at automatically extracting useful and compact representations of such complex data. Causal representation learning (CRL) aims to combine the core strengths of ML and causality by learning representations in the form of latent variables endowed with causal model semantics.
In this thesis, we study and present new results for different CRL settings. A central theme is the question of identifiability: Given infinite data, when are representations satisfying the same learning objective guaranteed to be equivalent? This is an important prerequisite for CRL, as it formally characterises if and when a learning task is, at least in principle, feasible. Since learning causal models, even without a representation learning component, is notoriously difficult, we require additional assumptions on the model class or rich data beyond the classical i.i.d. setting. By partially characterising identifiability for different settings, this thesis investigates what is possible for CRL without direct supervision, and thus contributes to its theoretical foundations. Ideally, the developed insights can help inform data collection practices or inspire the design of new practical estimation methods.  [54] arXiv:2406.13493 (crosslist from cs.LG) [pdf, other]

Title: InContext InContext Learning with Transformer Neural ProcessesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural processes (NPs) are a powerful family of metalearning models that seek to approximate the posterior predictive map of the groundtruth stochastic process from which each dataset in a metadataset is sampled. There are many cases in which practitioners, besides having access to the dataset of interest, may also have access to other datasets that share similarities with it. In this case, integrating these datasets into the NP can improve predictions. We equip NPs with this functionality and describe this paradigm as incontext incontext learning. Standard NP architectures, such as the convolutional conditional NP (ConvCNP) or the family of transformer neural processes (TNPs), are not capable of incontext incontext learning, as they are only able to condition on a single dataset. We address this shortcoming by developing the incontext incontext learning pseudotoken TNP (ICICLTNP). The ICICLTNP builds on the family of PTTNPs, which utilise pseudotokenbased transformer architectures to sidestep the quadratic computational complexity associated with regular transformer architectures. Importantly, the ICICLTNP is capable of conditioning on both sets of datapoints and sets of datasets, enabling it to perform incontext incontext learning. We demonstrate the importance of incontext incontext learning and the effectiveness of the ICICLTNP in a number of experiments.
 [55] arXiv:2406.13668 (crosslist from cs.LG) [pdf, html, other]

Title: Improved bounds for calibration via stronger sign preservation gamesYuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, Princewill OkoroaforSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
A set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was made. We study the fundamental problem of online calibrated forecasting of binary sequences, which was initially studied by Foster & Vohra (1998). They derived an algorithm with $O(T^{2/3})$ calibration error after $T$ time steps, and showed a lower bound of $\Omega(T^{1/2})$. These bounds remained stagnant for two decades, until Qiao & Valiant (2021) improved the lower bound to $\Omega(T^{0.528})$ by introducing a combinatorial game called sign preservation and showing that lower bounds for this game imply lower bounds for calibration.
We introduce a strengthening of Qiao & Valiant's game that we call sign preservation with reuse (SPR). We prove that the relationship between SPR and calibrated forecasting is bidirectional: not only do lower bounds for SPR translate into lower bounds for calibration, but algorithms for SPR also translate into new algorithms for calibrated forecasting. In particular, any strategy that improves the trivial upper bound for the value of the SPR game would imply a forecasting algorithm with calibration error exponent less than 2/3, improving Foster & Vohra's upper bound for the first time. Using similar ideas, we then prove a slightly stronger lower bound than that of Qiao & Valiant, namely $\Omega(T^{0.54389})$. Our lower bound is obtained by an oblivious adversary, marking the first $\omega(T^{1/2})$ calibration lower bound for oblivious adversaries.  [56] arXiv:2406.13725 (crosslist from cs.LG) [pdf, html, other]

Title: TreeSliced Wasserstein Distance on a System of LinesComments: 33 pages, 6 figures, 2 tables, 4 algorithmsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Sliced Wasserstein (SW) distance in Optimal Transport (OT) is widely used in various applications thanks to its statistical effectiveness and computational efficiency. On the other hand, Tree Wassenstein (TW) and Treesliced Wassenstein (TSW) are instances of OT for probability measures where its ground cost is a tree metric. TSW also has a low computational complexity, i.e. linear to the number of edges in the tree. Especially, TSW is identical to SW when the tree is a chain. While SW is prone to loss of topological information of input measures due to relying on onedimensional projection, TSW is more flexible and has a higher degree of freedom by choosing a tree rather than a line to alleviate the curse of dimensionality in SW. However, for practical applications, popular tree metric sampling methods are heavily built upon given supports, which limits their capacity to adapt to new supports. In this paper, we propose the TreeSliced Wasserstein distance on a System of Lines (TSWSL), which brings a connection between SW and TSW. Compared to SW and TSW, our TSWSL benefits from the higher degree of freedom of TSW while being suitable to dynamic settings as SW. In TSWSL, we use a variant of the Radon Transform to project measures onto a system of lines, resulting in measures on a space with a tree metric, then leverage TW to efficiently compute distances between them. We empirically verify the advantages of TSWSL over the traditional SW by conducting a variety of experiments on gradient flows, image style transfer, and generative models.
 [57] arXiv:2406.13762 (crosslist from cs.LG) [pdf, html, other]

Title: Unveiling the Hidden Structure of SelfAttention via Kernel Principal Component AnalysisComments: 33 pages, 5 figures, 12 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of selfattention. Similar to the development of most deep learning models, the construction of these attention mechanisms rely on heuristics and experience. In our work, we derive selfattention from kernel principal component analysis (kernel PCA) and show that selfattention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in selfattention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in selfattention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPCAttention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPCAttention over softmax attention on the ImageNet1K object classification, WikiText103 language modeling, and ADE20K image segmentation task.
 [58] arXiv:2406.13770 (crosslist from cs.LG) [pdf, html, other]

Title: Elliptical AttentionComments: 38 pages, 7 figures, 12 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Pairwise dotproduct selfattention is key to the success of transformers that achieve stateoftheart performance across a variety of applications in language and vision. This dotproduct selfattention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance. In particular, we define a hyperellipsoidal neighborhood around each query to increase the attention weights of the tokens lying in the contextually important directions. We term this novel class of attention Elliptical Attention. Our Elliptical Attention provides two benefits: 1) reducing representation collapse and 2) enhancing the model's robustness as the Elliptical Attention pays more attention to contextually relevant information rather than focusing on some small subset of informative features. We empirically demonstrate the advantages of Elliptical Attention over the baseline dotproduct attention and stateoftheart attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities.
 [59] arXiv:2406.13781 (crosslist from cs.LG) [pdf, html, other]

Title: A PrimalDual Framework for Transformers and Neural NetworksComments: Accepted to ICLR 2023, 26 pages, 4 figures, 14 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Selfattention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the selfattention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (AttentionBN) derived from the batch normalization layer and 2) the Attention with Scaled Head (AttentionSH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the AttentionBN and AttentionSH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and timeseries classification.
 [60] arXiv:2406.13822 (crosslist from qbio.NC) [pdf, other]

Title: Association of neighborhood disadvantage with cognitive function and cortical disorganization in an unimpaired cohortApoorva Safai, Erin Jonaitis, Rebecca E Langhough, William R Buckingham, Sterling C. Johnson, W. Ryan Powell, Amy J. H. Kind, Barbara B. Bendlin, Pallavi TiwariSubjects: Neurons and Cognition (qbio.NC); Applications (stat.AP)
Neighborhood disadvantage is associated with worse health and cognitive outcomes. Morphological similarity network (MSN) is a promising approach to elucidate cortical network patterns underlying complex cognitive functions. We hypothesized that MSNs could capture changes in cortical patterns related to neighborhood disadvantage and cognitive function. This crosssectional study included cognitively unimpaired participants from two large Alzheimers studies at University of WisconsinMadison. Neighborhood disadvantage status was obtained using the Area Deprivation Index (ADI). Cognitive performance was assessed on memory, processing speed and executive function. Morphological Similarity Networks (MSN) were constructed for each participant based on the similarity in distribution of cortical thickness of brain regions, followed by computation of local and global network features. Association of ADI with cognitive scores and MSN features were examined using linear regression and mediation analysis. ADI showed negative association with category fluency,implicit learning speed, story recall and modified preclinical Alzheimers cognitive composite scores, indicating worse cognitive function among those living in more disadvantaged neighborhoods. Local network features of frontal and temporal regions differed based on ADI status. Centrality of left lateral orbitofrontal region showed a partial mediating effect between association of neighborhood disadvantage and story recall performance. Our preliminary findings suggest differences in local cortical organization by neighborhood disadvantage, which partially mediated the relationship between ADI and cognitive performance, providing a possible networkbased mechanism to, inpart, explain the risk for poor cognitive functioning associated with disadvantaged neighborhoods.
 [61] arXiv:2406.13826 (crosslist from econ.EM) [pdf, html, other]

Title: Testing identification in mediation and dynamic treatment modelsComments: 49 pages, 4 figuresSubjects: Econometrics (econ.EM); Methodology (stat.ME)
We propose a test for the identification of causal effects in mediation and dynamic treatment models that is based on two sets of observed variables, namely covariates to be controlled for and suspected instruments, building on the test by Huber and Kueck (2022) for single treatment models. We consider models with a sequential assignment of a treatment and a mediator to assess the direct treatment effect (net of the mediator), the indirect treatment effect (via the mediator), or the joint effect of both treatment and mediator. We establish testable conditions for identifying such effects in observational data. These conditions jointly imply (1) the exogeneity of the treatment and the mediator conditional on covariates and (2) the validity of distinct instruments for the treatment and the mediator, meaning that the instruments do not directly affect the outcome (other than through the treatment or mediator) and are unconfounded given the covariates. Our framework extends to posttreatment sample selection or attrition problems when replacing the mediator by a selection indicator for observing the outcome, enabling joint testing of the selectivity of treatment and attrition. We propose a machine learningbased test to control for covariates in a datadriven manner and analyze its finite sample performance in a simulation study. Additionally, we apply our method to Slovak labor market data and find that our testable implications are not rejected for a sequence of training programs typically considered in dynamic treatment evaluations.
 [62] arXiv:2406.13966 (crosslist from cs.LG) [pdf, html, other]

Title: Causal Inference with Latent Variables: Recent Advances and Future ProspectivesComments: Accepted by KDD'24 Survey TrackSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Causality lays the foundation for the trajectory of our world. Causal inference (CI), which aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial research topic. Nevertheless, the lack of observation of important variables (e.g., confounders, mediators, exogenous variables, etc.) severely compromises the reliability of CI methods. The issue may arise from the inherent difficulty in measuring the variables. Additionally, in observational studies where variables are passively recorded, certain covariates might be inadvertently omitted by the experimenter. Depending on the type of unobserved variables and the specific CI task, various consequences can be incurred if these latent variables are carelessly handled, such as biased estimation of causal effects, incomplete understanding of causal mechanisms, lack of individuallevel causal consideration, etc. In this survey, we provide a comprehensive review of recent developments in CI with latent variables. We start by discussing traditional CI techniques when variables of interest are assumed to be fully observed. Afterward, under the taxonomy of circumvention and inferencebased methods, we provide an indepth discussion of various CI strategies to handle latent variables, covering the tasks of causal effect estimation, mediation analysis, counterfactual reasoning, and causal discovery. Furthermore, we generalize the discussion to graph data where interference among units may exist. Finally, we offer fresh aspects for further advancement of CI with latent variables, especially new opportunities in the era of large language models (LLMs).
 [63] arXiv:2406.14026 (crosslist from cs.LG) [pdf, html, other]

Title: Demystifying Forgetting in Language Model FineTuning with Statistical Analysis of Example AssociationsComments: 5 pagesSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Language models (LMs) are known to suffer from forgetting of previously learned examples when finetuned, breaking stability of deployed LM systems. Despite efforts on mitigating forgetting, few have investigated whether, and how forgotten upstream examples are associated with newly learned tasks. Insights on such associations enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples while the model learns $M$ new tasks and visualize their associations with a $M \times N$ matrix. We empirically demonstrate that the degree of forgetting can often be approximated by simple multiplicative contributions of the upstream examples and newly learned tasks. We also reveal more complicated patterns where specific subsets of examples are forgotten with statistics and visualization. Following our analysis, we predict forgetting that happens on upstream examples when learning a new task with matrix completion over the empirical associations, outperforming prior approaches that rely on trainable LMs. Project website: this https URL
 [64] arXiv:2406.14059 (crosslist from cs.GT) [pdf, other]

Title: Tracking solutions of timevarying variational inequalitiesSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Tracking the solution of timevarying variational inequalities is an important problem with applications in game theory, optimization, and machine learning. Existing work considers timevarying games or timevarying optimization problems. For strongly convex optimization problems or strongly monotone games, these results provide tracking guarantees under the assumption that the variation of the timevarying problem is restrained, that is, problems with a sublinear solution path. In this work we extend existing results in two ways: In our first result, we provide tracking bounds for (1) variational inequalities with a sublinear solution path but not necessarily monotone functions, and (2) for periodic timevarying variational inequalities that do not necessarily have a sublinear solution pathlength. Our second main contribution is an extensive study of the convergence behavior and trajectory of discrete dynamical systems of periodic timevarying VI. We show that these systems can exhibit provably chaotic behavior or can converge to the solution. Finally, we illustrate our theoretical results with experiments.
 [65] arXiv:2406.14062 (crosslist from qbio.QM) [pdf, html, other]

Title: An agentbased model of behaviour change calibrated to reversal learning dataComments: 23 pages, 5 figuresSubjects: Quantitative Methods (qbio.QM); Biological Physics (physics.bioph); Computation (stat.CO)
Behaviour change lies at the heart of many observable collective phenomena such as the transmission and control of infectious diseases, adoption of public health policies, and migration of animals to new habitats. Representing the process of individual behaviour change in computer simulations of these phenomena remains an open challenge. Often, computational models use phenomenological implementations with limited support from behavioural data. Without a strong connection to observable quantities, such models have limited utility for simulating observed and counterfactual scenarios of emergent phenomena because they cannot be validated or calibrated. Here, we present a simple stochastic individualbased model of reversal learning that captures fundamental properties of individual behaviour change, namely, the capacity to learn based on accumulated reward signals, and the transient persistence of learned behaviour after rewards are removed or altered. The model has only two parameters, and we use approximate Bayesian computation to demonstrate that they are fully identifiable from empirical reversal learning time series data. Finally, we demonstrate how the model can be extended to account for the increased complexity of behavioural dynamics over longer time scales involving fluctuating stimuli. This work is a step towards the development and evaluation of fully identifiable individuallevel behaviour change models that can function as validated submodels for complex simulations of collective behaviour change.
 [66] arXiv:2406.14163 (crosslist from cs.DB) [pdf, html, other]

Title: A Unified Statistical And Computational Framework For ExPost Harmonisation Of Aggregate StatisticsSubjects: Databases (cs.DB); Methodology (stat.ME)
Expost harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multisource datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of expost harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of expost harmonisation and designing error minimising workflows.
 [67] arXiv:2406.14246 (crosslist from qbio.QM) [pdf, html, other]

Title: NonNegative Universal Differential Equations With Applications in Systems BiologyComments: 6 pages, This work has been submitted to IFAC for possible publication. Initial submission was March 18, 2024Subjects: Quantitative Methods (qbio.QM); Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
Universal differential equations (UDEs) leverage the respective advantages of mechanistic models and artificial neural networks and combine them into one dynamic model. However, these hybrid models can suffer from unrealistic solutions, such as negative values for biochemical quantities. We present nonnegative UDE (nUDEs), a constrained UDE variant that guarantees nonnegative values. Furthermore, we explore regularisation techniques to improve generalisation and interpretability of UDEs.
 [68] arXiv:2406.14347 (crosslist from physics.chemph) [pdf, other]

Title: $\nabla^2$DFT: A Universal Quantum Chemistry Dataset of DrugLike Molecules and a Benchmark for Neural Network PotentialsKuzma Khrabrov, Anton Ber, Artem Tsypin, Konstantin Ushenin, Egor Rumiantsev, Alexander Telepov, Dmitry Protasov, Ilya Shenbin, Anton Alekseev, Mikhail Shirokikh, Sergey Nikolenko, Elena Tutubalina, Artur KadurinSubjects: Chemical Physics (physics.chemph); Machine Learning (cs.LG); Machine Learning (stat.ML)
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computeraided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and stateoftheart models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($\omega$B97XD/def2SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of druglike molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
 [69] arXiv:2406.14380 (crosslist from econ.EM) [pdf, html, other]

Title: Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks ApproachSubjects: Econometrics (econ.EM); Machine Learning (cs.LG); Methodology (stat.ME)
Recommender systems are essential for contentsharing platforms by curating personalized content. To evaluate updates of recommender systems targeting content creators, platforms frequently engage in creatorside randomized experiments to estimate treatment effect, defined as the difference in outcomes when a new (vs. the status quo) algorithm is deployed on the platform. We show that the standard differenceinmeans estimator can lead to a biased treatment effect estimate. This bias arises because of recommender interference, which occurs when treated and control creators compete for exposure through the recommender system. We propose a "recommender choice model" that captures how an item is chosen among a pool comprised of both treated and control content items. By combining a structural choice model with neural networks, the framework directly models the interference pathway in a microfounded way while accounting for rich viewercontent heterogeneity. Using the model, we construct a double/debiased estimator of the treatment effect that is consistent and asymptotically normal. We demonstrate its empirical performance with a field experiment on Weixin shortvideo platform: besides the standard creatorside experiment, we carry out a costly blocked doublesided randomization design to obtain a benchmark estimate without interference bias. We show that the proposed estimator significantly reduces the bias in treatment effect estimates compared to the standard differenceinmeans estimator.
 [70] arXiv:2406.14399 (crosslist from cs.LG) [pdf, html, other]

Title: WEATHER5K: A Largescale Global Station Weather Dataset Towards Comprehensive Timeseries Forecasting BenchmarkComments: 26 pages,13 figuresSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.aoph); Machine Learning (stat.ML)
Global Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data. However, existing public datasets for GSWF optimization and benchmarking still suffer from significant limitations, such as small sizes, limited temporal coverage, and a lack of comprehensive variables. These shortcomings prevent them from effectively reflecting the benchmarks of current forecasting methods and fail to support the real needs of operational weather forecasting. To address these challenges, we present the WEATHER5K dataset. This dataset comprises a comprehensive collection of data from 5,672 weather stations worldwide, spanning a 10year period with onehour intervals. It includes multiple crucial weather elements, providing a more reliable and interpretable resource for forecasting. Furthermore, our WEATHER5K dataset can serve as a benchmark for comprehensively evaluating existing wellknown forecasting models, extending beyond GSWF methods to support future timeseries research challenges and opportunities. The dataset and benchmark implementation are publicly available at: this https URL.
 [71] arXiv:2406.14469 (crosslist from cs.CE) [pdf, html, other]

Title: Fusion of Movement and Naive Predictions for Point Forecasting in Univariate Random WalksSubjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Traditional methods for point forecasting in univariate random walks often fail to surpass naive benchmarks due to data unpredictability. This study introduces a novel forecasting method that fuses movement prediction (binary classification) with naive forecasts for accurate onestepahead point forecasting. The method's efficacy is demonstrated through theoretical analysis, simulations, and realworld data experiments. It reliably exceeds naive forecasts with movement prediction accuracies as low as 0.55, outperforming baseline models like ARIMA, linear regression, MLP, and LSTM networks in forecasting the S\&P 500 index and Bitcoin prices. This method is particularly advantageous when accurate point predictions are challenging but accurate movement predictions are attainable, translating movement predictions into point forecasts in random walk contexts.
Cross submissions for Friday, 21 June 2024 (showing 28 of 28 entries )
 [72] arXiv:2112.07755 (replaced) [pdf, html, other]

Title: Separate Exchangeability as Modeling Principle in Bayesian NonparametricsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We argue for the use of separate exchangeability as a modeling principle in Bayesian nonparametric (BNP) inference. Separate exchangeability is \emph{de facto} widely applied in the Bayesian parametric case, e.g., it naturally arises in simple mixed models. However, while in some areas, such as random graphs, separate and (closely related) joint exchangeability are widely used, it is curiously underused for several other applications in BNP. We briefly review the definition of separate exchangeability focusing on the implications of such a definition in Bayesian modeling. We then discuss two tractable classes of models that implement separate exchangeability that are the natural counterparts of familiar partially exchangeable BNP models.
The first is nested random partitions for a data matrix, defining a partition of columns and nested partitions of rows, nested within column clusters. Many recent models for nested partitions implement partially exchangeable models related to variations of the wellknown nested Dirichlet process. We argue that inference under such models in some cases ignores important features of the experimental setup. We obtain the separately exchangeable counterpart of such partially exchangeable partition structures.
The second class is about setting up separately exchangeable priors for a nonparametric regression model when multiple sets of experimental units are involved. We highlight how a Dirichlet process mixture of linear models known as ANOVA DDP can naturally implement separate exchangeability in such regression problems. Finally, we illustrate how to perform inference under such models in two real data examples.  [73] arXiv:2208.13296 (replaced) [pdf, html, other]

Title: Polynomial time guarantees for sampling based posterior inference in highdimensional generalised linear modelsComments: Revised and updated versionSubjects: Statistics Theory (math.ST); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Probability (math.PR); Computation (stat.CO)
The problem of computing posterior functionals in general highdimensional statistical models with possibly nonlogconcave likelihood functions is considered. Based on the proof strategy of [49], but using only local likelihood conditions and without relying on Mestimation theory, nonasymptotic statistical and computational guarantees are provided for a gradient based MCMC algorithm. Given a suitable initialiser, these guarantees scale polynomially in key algorithmic quantities. The abstract results are applied to several concrete statistical models, including density estimation, nonparametric regression with generalised linear models and a canonical statistical nonlinear inverse problem from PDEs.
 [74] arXiv:2210.14484 (replaced) [pdf, html, other]

Title: Imputation of missing values in multiview dataComments: 49 pages, 15 figures. Accepted manuscriptJournalref: Information Fusion 111 (2024) 102524Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multiview data. When missing values occur in multiview data, all features in a view are likely to be missing simultaneously. This may lead to very large quantities of missing data which, especially when combined with highdimensionality, can make the application of conditional imputation methods computationally infeasible. However, the multiview structure could be leveraged to reduce the complexity and computational load of imputation. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multiview learning. It performs imputation in a dimensionreduced space to address computational challenges inherent to the multiview context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets and a real data application. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.
 [75] arXiv:2211.04776 (replaced) [pdf, html, other]

Title: Regularized R\'enyi divergence minimization through Bregman proximal gradient algorithmsSubjects: Statistics Theory (math.ST)
We study the variational inference problem of minimizing a regularized Rényi divergence over an exponential family, and propose a relaxed momentmatching algorithm, which includes a proximallike step. Using the informationgeometric link between Bregman divergences and the KullbackLeibler divergence, this algorithm is shown to be equivalent to a Bregman proximal gradient algorithm. This novel perspective allows us to exploit the geometry of our approximate model while using stochastic blackbox updates. We use this point of view to prove strong convergence guarantees including monotonic decrease of the objective, convergence to a stationary point or to the minimizer, and geometric convergence rates. These new theoretical insights lead to a versatile, robust, and competitive method, as illustrated by numerical experiments.
 [76] arXiv:2211.15498 (replaced) [pdf, html, other]

Title: Physicsinformed Neural Networks with Unknown Measurement NoiseSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Physicsinformed neural networks (PINNs) constitute a flexible approach to both finding solutions and identifying parameters of partial differential equations. Most works on the topic assume noiseless data, or data contaminated with weak Gaussian noise. We show that the standard PINN framework breaks down in case of nonGaussian noise. We give a way of resolving this fundamental issue and we propose to jointly train an energybased model (EBM) to learn the correct noise distribution. We illustrate the improved performance of our approach using multiple examples.
 [77] arXiv:2301.02446 (replaced) [pdf, other]

Title: Optimal Scaling Results for MoreauYosida Metropolisadjusted Langevin AlgorithmsSubjects: Computation (stat.CO); Probability (math.PR); Statistics Theory (math.ST)
We consider a recently proposed class of MCMC methods which uses proximity maps instead of gradients to build proposal mechanisms which can be employed for both differentiable and nondifferentiable targets. These methods have been shown to be stable for a wide class of targets, making them a valuable alternative to Metropolisadjusted Langevin algorithms (MALA); and have found wide application in imaging contexts. The wider stability properties are obtained by building the MoreauYosida envelope for the target of interest, which depends on a parameter $\lambda$. In this work, we investigate the optimal scaling problem for this class of algorithms, which encompasses MALA, and provide practical guidelines for the implementation of these methods.
 [78] arXiv:2302.03200 (replaced) [pdf, html, other]

Title: Multivariate Bayesian dynamic modeling for causal predictionSubjects: Methodology (stat.ME)
Bayesian forecasting is developed in multivariate time series analysis for causal inference. Causal evaluation of sequentially observed time series data from control and treated units focuses on the impacts of interventions using contemporaneous outcomes in control units. Methodological developments here concern multivariate dynamic models for timevarying effects across multiple treated units with explicit foci on sequential learning and aggregation of intervention effects. Analysis explores dimension reduction across multiple synthetic counterfactual predictors. Computational advances leverage fully conjugate models for efficient sequential learning and inference, including crossunit correlations and their time variation. This allows full uncertainty quantification on model hyperparameters via Bayesian model averaging. A detailed case study evaluates interventions in a supermarket promotions experiment, with coupled predictive analyses in selected regions of a largescale commercial system. Comparisons with existing methods highlight the issues of appropriate uncertainty quantification in casual inference in aggregation across treated units, among other practical concerns.
 [79] arXiv:2303.11706 (replaced) [pdf, html, other]

Title: Lower bounds for the tradeoff between bias and mean absolute deviationComments: This is an extended version of Section 7 of arXiv:2006.00278v3. The material has been removed from later versions of arXiv:2006.00278Journalref: Statistics and Probability Letters, Volume 213, 110182, 2024Subjects: Statistics Theory (math.ST)
In nonparametric statistics, rateoptimal estimators typically balance bias and stochastic error. The recent work on overparametrization raises the question whether rateoptimal estimators exist that do not obey this tradeoff. In this work we consider pointwise estimation in the Gaussian white noise model with regression function $f$ in a class of $\beta$Hölder smooth functions. Let 'worstcase' refer to the supremum over all functions $f$ in the Hölder class. It is shown that any estimator with worstcase bias $\lesssim n^{\beta/(2\beta+1)}=: \psi_n$ must necessarily also have a worstcase mean absolute deviation that is lower bounded by $\gtrsim \psi_n.$ To derive the result, we establish abstract inequalities relating the change of expectation for two probability measures to the mean absolute deviation.
 [80] arXiv:2304.09452 (replaced) [pdf, html, other]

Title: Support and distribution inference from noisy dataSubjects: Statistics Theory (math.ST)
We consider noisy observations of a distribution with unknown support. In the deconvolution model, it has been proved recently [19] that, under very mild assumptions, it is possible to solve the deconvolution problem without knowing the noise distribution and with no sample of the noise. We first give general settings where the theory applies and provide classes of supports that can be recovered in this context. We then exhibit classes of distributions over which we prove adaptive minimax rates (up to a log log factor) for the estimation of the support in Hausdorff distance. Moreover, for the class of distributions with compact support, we provide estimators of the unknown (in general singular) distribution and prove maximum rates in Wasserstein distance. We also prove an almost matching lower bound on the associated minimax risk.
 [81] arXiv:2305.02864 (replaced) [pdf, html, other]

Title: Existence and approximation of densities of chord length and cross section area distributionsComments: 21 pagesJournalref: Image Analysis and Stereology. 42 (2023) 171184Subjects: Applications (stat.AP)
In various stereological problems an $n$dimensional convex body is intersected with an $(n1)$dimensional Isotropic Uniformly Random (IUR) hyperplane. In this paper the cumulative distribution function associated with the $(n1)$dimensional volume of such a random section is studied. This distribution is also known as chord length distribution and cross section area distribution in the planar and spatial case respectively. For various classes of convex bodies it is shown that these distribution functions are absolutely continuous with respect to Lebesgue measure. A Monte Carlo simulation scheme is proposed for approximating the corresponding probability density functions.
 [82] arXiv:2306.10800 (replaced) [pdf, other]

Title: Multilevel Surrogatebased Control VariatesMohamed Reda El Amri (IFPEN), Paul Mycek (CERFACS, CONCACE), Sophie Ricci (CERFACS), Matthias De LozzoSubjects: Statistics Theory (math.ST)
Monte Carlo (MC) sampling is a popular method for estimating the statistics (e.g. expectation and variance) of a random variable. Its slow convergence has led to the emergence of advanced techniques to reduce the variance of the MC estimator for the outputs of computationally expensive solvers. The control variates (CV) method corrects the MC estimator with a term derived from auxiliary random variables that are highly correlated with the original random variable. These auxiliary variables may come from surrogate models. Such a surrogatebased CV strategy is extended here to the multilevel Monte Carlo (MLMC) framework, which relies on a sequence of levels corresponding to numerical simulators with increasing accuracy and computational cost. MLMC combines output samples obtained across levels, into a telescopic sum of differences between MC estimators for successive fidelities. In this paper, we introduce three multilevel variance reduction strategies that rely on surrogatebased CV and MLMC. MLCV is presented as an extension of CV where the correction terms devised from surrogate models for simulators of different levels add up. MLMCCV improves the MLMC estimator by using a CV based on a surrogate of the correction term at each level. Further variance reduction is achieved by using the surrogatebased CVs of all the levels in the MLMCMLCV strategy. Alternative solutions that reduce the subset of surrogates used for the multilevel estimation are also introduced. The proposed methods are tested on a test case from the literature consisting of a spectral discretization of an uncertain 1D heat equation, where the statistic of interest is the expected value of the integrated temperature along the domain at a given time. The results are assessed in terms of the accuracy and computational cost of the multilevel estimators, depending on whether the construction of the surrogates, and the associated computational cost, precede the evaluation of the estimator. It was shown that when the lower fidelity outputs are strongly correlated with the highfidelity outputs, a significant variance reduction is obtained when using surrogate models for the coarser levels only. It was also shown that taking advantage of preexisting surrogate models proves to be an even more efficient strategy.
 [83] arXiv:2306.12949 (replaced) [pdf, html, other]

Title: On the use of the Gram matrix for multivariate functional principal components analysisComments: 34 pages, 18 figures, Supplementary: 3 pagesSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Dimension reduction is crucial in functional data analysis (FDA). The key tool to reduce the dimension of the data is functional principal component analysis. Existing approaches for functional principal component analysis usually involve the diagonalization of the covariance operator. With the increasing size and complexity of functional datasets, estimating the covariance operator has become more challenging. Therefore, there is a growing need for efficient methodologies to estimate the eigencomponents. Using the duality of the space of observations and the space of functional features, we propose to use the innerproduct between the curves to estimate the eigenelements of multivariate and multidimensional functional datasets. The relationship between the eigenelements of the covariance operator and those of the innerproduct matrix is established. We explore the application of these methodologies in several FDA settings and provide general guidance on their usability.
 [84] arXiv:2307.05732 (replaced) [pdf, html, other]

Title: From isotonic to Lipschitz regression: a new interpolative perspective on shaperestricted estimationSubjects: Methodology (stat.ME)
This manuscript seeks to bridge two seemingly disjoint paradigms of nonparametric regression estimation based on smoothness assumptions and shape constraints. The proposed approach is motivated by a conceptually simple observation: Every Lipschitz function is a sum of monotonic and linear functions. This principle is further generalized to the higherorder monotonicity and multivariate covariates. A family of estimators is proposed based on a samplesplitting procedure, which inherits desirable methodological, theoretical, and computational properties of shaperestricted estimators. Our theoretical analysis provides convergence guarantees of the estimator under heteroscedastic and heavytailed errors, as well as adaptive properties to the complexity of the true regression function. The generality of the proposed decomposition framework is demonstrated through new approximation results, and extensive numerical studies validate the theoretical properties and empirical evidence for the practicalities of the proposed estimation framework.
 [85] arXiv:2308.16681 (replaced) [pdf, html, other]

Title: One Model Many Scores: Using Multiverse Analysis to Prevent Fairness Hacking and Evaluate the Influence of Model Design DecisionsJournalref: FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (2024) 13051320Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
A vast number of systems across the world use algorithmic decision making (ADM) to (partially) automate decisions that have previously been made by humans. The downstream effects of ADM systems critically depend on the decisions made during a systems' design, implementation, and evaluation, as biases in data can be mitigated or reinforced along the modeling pipeline. Many of these decisions are made implicitly, without knowing exactly how they will influence the final system. To study this issue, we draw on insights from the field of psychology and introduce the method of multiverse analysis for algorithmic fairness. In our proposed method, we turn implicit decisions during design and evaluation into explicit ones and demonstrate their fairness implications. By combining decisions, we create a grid of all possible "universes" of decision combinations. For each of these universes, we compute metrics of fairness and performance. Using the resulting dataset, one can investigate the variability and robustness of fairness scores and see how and which decisions impact fairness. We demonstrate how multiverse analyses can be used to better understand fairness implications of design and evaluation decisions using an exemplary case study of predicting public health care coverage for vulnerable populations. Our results highlight how decisions regarding the evaluation of a system can lead to vastly different fairness metrics for the same model. This is problematic, as a nefarious actor could optimise or "hack" a fairness metric to portray a discriminating model as fair merely by changing how it is evaluated. We illustrate how a multiverse analysis can help to address this issue.
 [86] arXiv:2309.01404 (replaced) [pdf, html, other]

Title: Hierarchical Regression Discontinuity Design: Pursuing Subgroup Treatment EffectsComments: 24 pagesSubjects: Methodology (stat.ME)
Regression discontinuity design (RDD) is widely adopted for causal inference under intervention determined by a continuous variable. While one is interested in treatment effect heterogeneity by subgroups in many applications, RDD typically suffers from small subgroupwise sample sizes, which makes the estimation results highly instable. To solve this issue, we introduce hierarchical RDD (HRDD), a hierarchical Bayes approach for pursuing treatment effect heterogeneity in RDD. A key feature of HRDD is to employ a pseudomodel based on a loss function to estimate subgrouplevel parameters of treatment effects under RDD, and assign a hierarchical prior distribution to ''borrow strength'' from other subgroups. The posterior computation can be easily done by a simple Gibbs sampling, and the optimal bandwidth can be automatically selected by the Hyvärinen scores for unnormalized models. We demonstrate the proposed HRDD through simulation and real data analysis, and show that HRDD provides much more stable point and interval estimation than separately applying the standard RDD method to each subgroup.
 [87] arXiv:2309.15001 (replaced) [pdf, html, other]

Title: Convergence guarantees for forward gradient descent in the linear regression modelComments: 17 pagesJournalref: Journal of Statistical Planning and Inference, Volume 233, 106174, 2024Subjects: Statistics Theory (math.ST); Neural and Evolutionary Computing (cs.NE)
Renewed interest in the relationship between artificial and biological neural networks motivates the study of gradientfree methods. Considering the linear regression model with random design, we theoretically analyze in this work the biologically motivated (weightperturbed) forward gradient scheme that is based on random linear combination of the gradient. If d denotes the number of parameters and k the number of samples, we prove that the mean squared error of this method converges for $k\gtrsim d^2\log(d)$ with rate $d^2\log(d)/k.$ Compared to the dimension dependence d for stochastic gradient descent, an additional factor $d\log(d)$ occurs.
 [88] arXiv:2310.04924 (replaced) [pdf, html, other]

Title: Markov Chain Monte Carlo Significance TestsComments: 20 pages, 7 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Monte Carlo significance tests are a general tool that produce pvalues by generating samples from the null distribution. However, Monte Carlo tests are limited to null hypothesis which we can exactly sample from. Markov chain Monte Carlo (MCMC) significance tests are a way to produce statistical valid pvalues for null hypothesis we can only approximately sample from. These methods were first introduced by Besag and Clifford in 1989 and make no assumptions on the mixing time of the MCMC procedure. Here we review the two methods of Besag and Clifford and introduce a new method that unifies the existing procedures. We use simple examples to highlight the difference between MCMC significance tests and standard Monte Carlo tests based on exact sampling. We also survey a range of contemporary applications in the literature including goodnessoffit testing for the Rasch model, tests for detecting gerrymandering [8] and a permutation based test of conditional independence [3].
 [89] arXiv:2310.05781 (replaced) [pdf, html, other]

Title: On variational inference and maximum likelihood estimation with the {\lambda}exponential familySubjects: Statistics Theory (math.ST)
The {\lambda}exponential family has recently been proposed to generalize the exponential family. While the exponential family is wellunderstood and widely used, this it not the case of the {\lambda}exponential family. However, many applications require models that are more general than the exponential family. In this work, we propose a theoretical and algorithmic framework to solve variational inference and maximum likelihood estimation problems over the {\lambda}exponential family. We give new sufficient optimality conditions for variational inference problems. Our conditions take the form of generalized momentmatching conditions and generalize existing similar results for the exponential family. We exhibit novel characterizations of the solutions of maximum likelihood estimation problems, that recover optimality conditions in the case of the exponential family. For the resolution of both problems, we propose novel proximallike algorithms that exploit the geometry underlying the {\lambda}exponential family. These new theoretical and methodological insights are tested on numerical examples, showcasing their usefulness and interest, especially on heavytailed target distributions.
 [90] arXiv:2310.09766 (replaced) [pdf, html, other]

Title: PseudoBayesian OptimizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian Optimization is a popular approach for optimizing expensive blackbox functions. Its key idea is to use a surrogate model to approximate the objective and, importantly, quantify the associated uncertainty that allows a sequential search of query points that balance exploitationexploration. Gaussian process (GP) has been a primary candidate for the surrogate model, thanks to its Bayesianprincipled uncertainty quantification power and modeling flexibility. However, its challenges have also spurred an array of alternatives whose convergence properties could be more opaque. Motivated by these, we study in this paper an axiomatic framework that elicits the minimal requirements to guarantee blackbox optimization convergence that could apply beyond GPbased methods. Moreover, we leverage the design freedom in our framework, which we call PseudoBayesian Optimization, to construct empirically superior algorithms. In particular, we show how using simple local regression, and a suitable "randomized prior" construction to quantify uncertainty, not only guarantees convergence but also consistently outperforms stateoftheart benchmarks in examples ranging from highdimensional synthetic experiments to realistic hyperparameter tuning and robotic applications.
 [91] arXiv:2310.17467 (replaced) [pdf, html, other]

Title: The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instabilitySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Generative diffusion models have achieved spectacular performance in many areas of machine learning and generative modeling. While the fundamental ideas behind these models come from nonequilibrium physics, variational inference and stochastic calculus, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo secondorder phase transitions corresponding to symmetry breaking phenomena. We show that these phasetransitions are always in a meanfield universality class, as they are the result of a selfconsistency condition in the generative dynamics. We argue that the critical instability that arises from the phase transitions lies at the heart of their generative capabilities, which are characterized by a set of meanfield critical exponents. Finally, we show that the dynamic equation of the generative process can be interpreted as a stochastic adiabatic transformation that minimizes the free energy while keeping the system in thermal equilibrium.
 [92] arXiv:2310.17806 (replaced) [pdf, html, other]

Title: Transporting treatment effects from differenceindifferences studiesSubjects: Methodology (stat.ME)
Differenceindifferences (DID) is a popular approach to identify the causal effects of treatments and policies in the presence of unmeasured confounding. DID identifies the sample average treatment effect in the treated (SATT). However, a goal of such research is often to inform decisionmaking in target populations outside the treated sample. Transportability methods have been developed to extend inferences from study samples to external target populations; these methods have primarily been developed and applied in settings where identification is based on conditional independence between the treatment and potential outcomes, such as in a randomized trial. We present a novel approach to identifying and estimating effects in a target population, based on DID conducted in a study sample that differs from the target population. We present a range of assumptions under which one may identify causal effects in the target population and employ causal diagrams to illustrate these assumptions. In most realistic settings, results depend critically on the assumption that any unmeasured confounders are not effect measure modifiers on the scale of the effect of interest (e.g., risk difference, odds ratio). We develop several estimators of transported effects, including gcomputation, inverse odds weighting, and a doubly robust estimator based on the efficient influence function. Simulation results support theoretical properties of the proposed estimators. As an example, we apply our approach to study the effects of a 2018 US federal smokefree public housing law on air quality in public housing across the US, using data from a DID study conducted in New York City alone.
 [93] arXiv:2310.20088 (replaced) [pdf, html, other]

Title: Functional Principal Component Analysis for DistributionValued ProcessesSubjects: Methodology (stat.ME)
We develop statistical models for samples of distributionvalued stochastic processes featuring timeindexed univariate distributions, with emphasis on functional principal component analysis. The proposed model presents an intrinsic rather than transformationbased approach. The starting point is a transport process representation for distributionvalued processes under the Wasserstein metric. Substituting transports for distributions addresses the challenge of centering distributionvalued processes and leads to a useful and interpretable decomposition of each realized process into a processspecific single transport and a realvalued trajectory. This representation makes it possible to utilize a scalar multiplication operation for transports and facilitates not only functional principal component analysis but also to introduce a latent Gaussian process. This Gaussian process proves especially useful for the case where the distributionvalued processes are only observed on a sparse grid of time points, establishing an approach for longitudinal distributionvalued data. We study the convergence of the key components of this novel representation to their population targets and demonstrate the practical utility of the proposed approach through simulations and several data illustrations.
 [94] arXiv:2311.08636 (replaced) [pdf, html, other]

Title: Supervised lowrank seminonnegative matrix factorization with frequency regularization for forecasting spatiotemporal dataComments: 35 pages, Final versionJournalref: Journal of Scientific Computing (2024)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We propose a novel methodology for forecasting spatiotemporal data using supervised seminonnegative matrix factorization (SSNMF) with frequency regularization. Matrix factorization is employed to decompose spatiotemporal data into spatial and temporal components. To improve clarity in the temporal patterns, we introduce a nonnegativity constraint on the time domain along with regularization in the frequency domain. Specifically, regularization in the frequency domain involves selecting features in the frequency space, making an interpretation in the frequency domain more convenient. We propose two methods in the frequency domain: soft and hard regularizations, and provide convergence guarantees to firstorder stationary points of the corresponding constrained optimization problem. While our primary motivation stems from geophysical data analysis based on GRACE (Gravity Recovery and Climate Experiment) data, our methodology has the potential for wider application. Consequently, when applying our methodology to GRACE data, we find that the results with the proposed methodology are comparable to previous research in the field of geophysical sciences but offer clearer interpretability.
 [95] arXiv:2311.11900 (replaced) [pdf, html, other]

Title: Measuring and Mitigating Biases in Motor Insurance PricingSubjects: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
The nonlife insurance sector operates within a highly competitive and tightly regulated framework, confronting a pivotal juncture in the formulation of pricing strategies. Insurers are compelled to harness a range of statistical methodologies and available data to construct optimal pricing structures that align with the overarching corporate strategy while accommodating the dynamics of market competition. Given the fundamental societal role played by insurance, premium rates are subject to rigorous scrutiny by regulatory authorities. These rates must conform to principles of transparency, explainability, and ethical considerations. Consequently, the act of pricing transcends mere statistical calculations and carries the weight of strategic and societal factors. These multifaceted concerns may drive insurers to establish equitable premiums, taking into account various variables. For instance, regulations mandate the provision of equitable premiums, considering factors such as policyholder gender or mutualist group dynamics in accordance with respective corporate strategies. Agebased premium fairness is also mandated. In certain insurance domains, variables such as the presence of serious illnesses or disabilities are emerging as new dimensions for evaluating fairness. Regardless of the motivating factor prompting an insurer to adopt fairer pricing strategies for a specific variable, the insurer must possess the capability to define, measure, and ultimately mitigate any ethical biases inherent in its pricing practices while upholding standards of consistency and performance. This study seeks to provide a comprehensive set of tools for these endeavors and assess their effectiveness through practical application in the context of automobile insurance.
 [96] arXiv:2312.16260 (replaced) [pdf, html, other]

Title: Multinomial Link ModelsComments: 39 pages, 5 figuresSubjects: Methodology (stat.ME)
We propose a unified multinomial link model for analyzing categorical responses. It not only covers the existing multinomial logistic models and their extensions as special cases, but also includes new models that can incorporate the observations with NA or Unknown responses in the data analysis. We provide explicit formulae and detailed algorithms for finding the maximum likelihood estimates of the model parameters and computing the Fisher information matrix. Our algorithms solve the infeasibility issue of existing statistical software on estimating parameters of cumulative link models. The applications to real datasets show that the new models can fit the data significantly better, and the corresponding data analysis may correct the misleading conclusions due to missing responses.
 [97] arXiv:2402.05271 (replaced) [pdf, html, other]

Title: Feature learning as alignment: a structural property of gradient descent in nonlinear neural networksSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Understanding the mechanisms through which neural networks extract statistics from inputlabel pairs through feature learning is one of the most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as the neural feature ansatz (NFA). Through the NFA, the authors introduce mapping with the AGOP as a general mechanism for neural feature learning. However, these works do not provide a theoretical explanation for this correlation or its origins. In this work, we further clarify the nature of this correlation, and explain its emergence. We show that this correlation is equivalent to alignment between the left singular structure of the weight matrices and the newly defined preactivation tangent features at each layer. We further establish that the alignment is driven by the interaction of weight changes induced by SGD with the preactivation features, and analyze the resulting dynamics analytically at early times in terms of simple statistics of the inputs and labels. Finally, motivated by the observation that the NFA is driven by this centered correlation, we introduce a simple optimization rule that dramatically increases the NFA correlations at any given layer and improves the quality of features learned.
 [98] arXiv:2402.07445 (replaced) [pdf, html, other]

Title: Top$K$ ranking with a monotone adversaryComments: Accepted to Conference of Learning Theory, 2024Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
In this paper, we address the top$K$ ranking problem with a monotone adversary. We consider the scenario where a comparison graph is randomly generated and the adversary is allowed to add arbitrary edges. The statistician's goal is then to accurately identify the top$K$ preferred items based on pairwise comparisons derived from this semirandom comparison graph. The main contribution of this paper is to develop a weighted maximum likelihood estimator (MLE) that achieves nearoptimal sample complexity, up to a $\log^2(n)$ factor, where $n$ denotes the number of items under comparison. This is made possible through a combination of analytical and algorithmic innovations. On the analytical front, we provide a refined~$\ell_\infty$ error analysis of the weighted MLE that is more explicit and tighter than existing analyses. It relates the~$\ell_\infty$ error with the spectral properties of the weighted comparison graph. Motivated by this, our algorithmic innovation involves the development of an SDPbased approach to reweight the semirandom graph and meet specified spectral properties. Additionally, we propose a firstorder method based on the Matrix Multiplicative Weight Update (MMWU) framework. This method efficiently solves the resulting SDP in nearlylinear time relative to the size of the semirandom comparison graph.
 [99] arXiv:2402.10418 (replaced) [pdf, html, other]

Title: A Distributionally Robust Estimator that Dominates the Empirical AverageSubjects: Statistics Theory (math.ST)
We leverage the duality between riskaverse and distributionally robust optimization (DRO) to devise a distributionally robust estimator that strictly outperforms the empirical average for all probability distributions with negative excess kurtosis. The aforesaid estimator solves the $\chi^{2}$robust mean squared error problem in closed form.
 [100] arXiv:2403.03589 (replaced) [pdf, html, other]

Title: Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate ChoicesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
This study designs an adaptive experiment for efficiently estimating average treatment effects (ATEs). In each round of our adaptive experiment, an experimenter sequentially samples an experimental unit, assigns a treatment, and observes the corresponding outcome immediately. At the end of the experiment, the experimenter estimates an ATE using the gathered samples. The objective is to estimate the ATE with a smaller asymptotic variance. Existing studies have designed experiments that adaptively optimize the propensity score (treatmentassignment probability). As a generalization of such an approach, we propose optimizing the covariate density as well as the propensity score. First, we derive the efficient covariate density and propensity score that minimize the semiparametric efficiency bound and find that optimizing both covariate density and propensity score minimizes the semiparametric efficiency bound more effectively than optimizing only the propensity score. Next, we design an adaptive experiment using the efficient covariate density and propensity score sequentially estimated during the experiment. Lastly, we propose an ATE estimator whose asymptotic variance aligns with the minimized semiparametric efficiency bound.
 [101] arXiv:2403.09956 (replaced) [pdf, html, other]

Title: On the distribution of isometric logratio transformations under extramultinomial count dataSubjects: Methodology (stat.ME)
Compositional data arise when count observations are normalised into proportions adding up to unity. To allow use of standard statistical methods, compositional proportions can be mapped from the simplex into the Euclidean space through the isometric logratio (ilr) transformation. When the counts follow a multinomial distribution with fixed classspecific probabilities, the distribution of the ensuing ilr coordinates has been shown to be asymptotically multivariate normal. We here derive an asymptotic normal approximation to the distribution of the ilr coordinates when the counts show overdispersion under the Dirichletmultinomial mixture model. Using a simulation study, we then investigate the practical applicability of the approximation against the empirical distribution of the ilr coordinates under varying levels of extramultinomial variation and the total count. The approximation works well, except with a small total count or high amount of overdispersion. These empirical results remain even under populationlevel heterogeneity in the total count. Our work is motivated by microbiome data, which often exhibit considerable extramultinomial variation and are increasingly treated as compositional through scaling taxonspecific counts into proportions. We conclude that if the analysis of empirical data relies on normality of the ilr coordinates, it may be advisable to choose a taxonomic level where counts are less sparse so that the distribution of taxonspecific class probabilities remains unimodal.
 [102] arXiv:2405.11547 (replaced) [pdf, html, other]

Title: Certified Robust Accuracy of Neural Networks Are Bounded due to Bayes ErrorsComments: accepted by CAV 2024Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Adversarial examples pose a security threat to many critical systems built on neural networks. While certified training improves robustness, it also decreases accuracy noticeably. Despite various proposals for addressing this issue, the significant accuracy drop remains. More importantly, it is not clear whether there is a certain fundamental limit on achieving robustness whilst maintaining accuracy. In this work, we offer a novel perspective based on Bayes errors. By adopting Bayes error to robustness analysis, we investigate the limit of certified robust accuracy, taking into account data distribution uncertainties. We first show that the accuracy inevitably decreases in the pursuit of robustness due to changed Bayes error in the altered data distribution. Subsequently, we establish an upper bound for certified robust accuracy, considering the distribution of individual classes and their boundaries. Our theoretical results are empirically evaluated on realworld datasets and are shown to be consistent with the limited success of existing certified training results, e.g., for CIFAR10, our analysis results in an upper bound (of certified robust accuracy) of 67.49\%, meanwhile existing approaches are only able to increase it from 53.89\% in 2017 to 62.84\% in 2023.
 [103] arXiv:2405.18051 (replaced) [pdf, other]

Title: Predicting Progression Events in Multiple Myeloma from Routine Blood WorkMaximilian Ferle, Nora Grieb, Markus Kreuz, Uwe Platzbecker, Thomas Neumuth, Kristin Reiche, Alexander Oeser, Maximilian MerzComments: 18 pages, 8 figures, 4, tablesSubjects: Applications (stat.AP); Quantitative Methods (qbio.QM)
The ability to accurately predict disease progression is paramount for optimizing multiple myeloma patient care. This study introduces a hybrid neural network architecture, combining Long ShortTerm Memory networks with a Conditional Restricted Boltzmann Machine, to predict future blood work of affected patients from a series of historical laboratory results. We demonstrate that our model can replicate the statistical moments of the time series ($0.95~\pm~0.01~\geq~R^2~\geq~0.83~\pm~0.03$) and forecast future blood work features with high correlation to actual patient data ($0.92\pm0.02~\geq~r~\geq~0.52~\pm~0.09$). Subsequently, a second Long ShortTerm Memory network is employed to detect and annotate disease progression events within the forecasted blood work time series. We show that these annotations enable the prediction of progression events with significant reliability (AUROC$~=~0.88~\pm~0.01$), up to 12 months in advance (AUROC($t+12~$mos)$~=0.65~\pm~0.01$). Our system is designed in a modular fashion, featuring separate entities for forecasting and progression event annotation. This structure not only enhances interpretability but also facilitates the integration of additional modules to perform subsequent operations on the generated outputs. Our approach utilizes a minimal set of routine blood work measurements, which avoids the need for expensive or resourceintensive tests and ensures accessibility of the system in clinical routine. This capability allows for individualized risk assessment and making informed treatment decisions tailored to a patient's unique disease kinetics. The represented approach contributes to the development of a scalable and costeffective virtual human twin system for optimized healthcare resource utilization and improved patient outcomes in multiple myeloma care.
 [104] arXiv:2405.18055 (replaced) [pdf, html, other]

Title: Dimensionfree uniform concentration bound for logistic regressionComments: 26 pages; revised introductionSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
We provide a novel dimensionfree uniform concentration bound for the empirical risk function of constrained logistic regression. Our bound yields a milder sufficient condition for a uniform law of large numbers than conditions derived by the Rademacher complexity argument and McDiarmid's inequality. The derivation is based on the PACBayes approach with secondorder expansion and Rademachercomplexitybased bounds for the residual term of the expansion.
 [105] arXiv:2406.01268 (replaced) [pdf, html, other]

Title: Integral Probability Metrics on submanifolds: interpolation inequalities and optimal inferenceSubjects: Statistics Theory (math.ST)
We study interpolation inequalities between Hölder Integral Probability Metrics (IPMs) in the case where the measures have densities on closed submanifolds. Precisely, it is shown that if two probability measures $\mu$ and $\mu^\star$ have $\beta$smooth densities with respect to the volume measure of some submanifolds $\mathcal{M}$ and $\mathcal{M}^\star$ respectively, then the Hölder IPMs $d_{\mathcal{H}^\gamma_1}$ of smoothness $\gamma\geq 1$ and $d_{\mathcal{H}^\eta_1}$ of smoothness $\eta>\gamma$, satisfy $d_{ \mathcal{H}_1^{\gamma}}(\mu,\mu^\star)\lesssim d_{ \mathcal{H}_1^{\eta}}(\mu,\mu^\star)^\frac{\beta+\gamma}{\beta+\eta}$, up to logarithmic factors. We provide an application of this result to highdimensional inference. These functional inequalities turn out to be a key tool for density estimation on unknown submanifold. In particular, it allows to build the first estimator attaining optimal rates of estimation for all the distances $d_{\mathcal{H}_1^\gamma}$, $\gamma \in [1,\infty)$ simultaneously.
 [106] arXiv:2406.05714 (replaced) [pdf, html, other]

Title: Contextual Continuum Bandits: Static Versus Dynamic RegretSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We study the contextual continuum bandits problem, where the learner sequentially receives a side information vector and has to choose an action in a convex set, minimizing a function associated to the context. The goal is to minimize all the underlying functions for the received contexts, leading to a dynamic (contextual) notion of regret, which is stronger than the standard static regret. Assuming that the objective functions are Hölder with respect to the contexts, we demonstrate that any algorithm achieving a sublinear static regret can be extended to achieve a sublinear dynamic regret. We further study the case of strongly convex and smooth functions when the observations are noisy. Inspired by the interior point method and employing selfconcordant barriers, we propose an algorithm achieving a sublinear dynamic regret. Lastly, we present a minimax lower bound, implying two key facts. First, no algorithm can achieve sublinear dynamic regret over functions that are not continuous with respect to the context. Second, for strongly convex and smooth functions, the algorithm that we propose achieves, up to a logarithmic factor, the minimax optimal rate of dynamic regret as a function of the number of queries.
 [107] arXiv:2406.07804 (replaced) [pdf, html, other]

Title: The maximum likelihood type estimator of SDEs with fractional Brownian motion under small noise asymptotics in the rough caseSubjects: Statistics Theory (math.ST); Probability (math.PR)
We study the problem of parametric estimation for continuously observed stochastic differential equation driven by fractional Brownian motion. Under some assumptions on drift and diffusion coefficients, we construct maximum likelihood estimator and establish its the asymptotic normality and moment convergence of the drift parameter when a small dispersion coefficient vanishes.
 [108] arXiv:2406.09051 (replaced) [pdf, other]

Title: Bayesian Structural Model Updating with Multimodal Variational AutoencoderComments: 44 pages, 21 figuresJournalref: Computer Methods in Applied Mechanics and Engineering,Volume 429, 1 September 2024, 117148Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
A novel framework for Bayesian structural model updating is presented in this study. The proposed method utilizes the surrogate unimodal encoders of a multimodal variational autoencoder (VAE). The method facilitates an approximation of the likelihood when dealing with a small number of observations. It is particularly suitable for highdimensional correlated simultaneous observations applicable to various dynamic analysis models. The proposed approach was benchmarked using a numerical model of a singlestory frame building with acceleration and dynamic strain measurements. Additionally, an example involving a Bayesian update of nonlinear model parameters for a threedegreeoffreedom lumped mass model demonstrates computational efficiency when compared to using the original VAE, while maintaining adequate accuracy for practical applications.
 [109] arXiv:2406.10473 (replaced) [pdf, html, other]

Title: Designbased variance estimation of the H\'ajek effect estimator in stratified and clustered experimentsSubjects: Methodology (stat.ME)
Randomized controlled trials (RCTs) are used to evaluate treatment effects. When individuals are grouped together, clustered RCTs are conducted. Stratification is recommended to reduce imbalance of baseline covariates between treatment and control. In practice, this can lead to comparisons between clusters of very different sizes. As a result, direct adjustment estimators that average differences of means within the strata may be inconsistent. We study differences of inverse probability weighted means of a treatment and a control group  Hájek effect estimators  under two common forms of stratification: small strata that increase in number; or larger strata with growing numbers of clusters in each. Under either scenario, mild conditions give consistency and asymptotic Normality. We propose a variance estimator applicable to designs with any number of strata and strata of any size. We describe a special use of the variance estimator that improves small sample performance of Waldtype confidence intervals. The Hájek effect estimator lends itself to covariance adjustment, and our variance estimator remains applicable. Simulations and realworld applications in children's nutrition and education confirm favorable operating characteristics, demonstrating advantages of the Hájek effect estimator beyond its simplicity and ease of use.
 [110] arXiv:2406.10703 (replaced) [pdf, html, other]

Title: Calibrating Neural Networks' parameters through Optimal Contraction in a Prediction ProblemComments: 14 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
This study introduces a novel approach to ensure the existence and uniqueness of optimal parameters in neural networks. The paper details how a recurrent neural networks (RNN) can be transformed into a contraction in a domain where its parameters are linear. It then demonstrates that a prediction problem modeled through an RNN, with a specific regularization term in the loss function, can have its firstorder conditions expressed analytically. This system of equations is reduced to two matrix equations involving Sylvester equations, which can be partially solved. We establish that, if certain conditions are met, optimal parameters exist, are unique, and can be found through a straightforward algorithm to any desired precision. Also, as the number of neurons grows the conditions of convergence become easier to fulfill. Feedforward neural networks (FNNs) are also explored by including linear constraints on parameters. According to our model, incorporating loops (with fixed or variable weights) will produce loss functions that train easier, because it assures the existence of a region where an iterative method converges.
 [111] arXiv:2406.12763 (replaced) [pdf, html, other]

Title: Implicit Bias of Mirror Flow on Separable DataComments: Exact same text as first version but the acknowledgments section is updatedSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
We examine the continuoustime counterpart of mirror descent, namely mirror flow, on classification problems which are linearly separable. Such problems are minimised `at infinity' and have many possible solutions; we study which solution is preferred by the algorithm depending on the mirror potential. For exponential tailed losses and under mild assumptions on the potential, we show that the iterates converge in direction towards a $\phi_\infty$maximum margin classifier. The function $\phi_\infty$ is the $\textit{horizon function}$ of the mirror potential and characterises its shape `at infinity'. When the potential is separable, a simple formula allows to compute this function. We analyse several examples of potentials and provide numerical experiments highlighting our results.
 [112] arXiv:1808.05671 (replaced) [pdf, html, other]

Title: On the Convergence of Adaptive Gradient Methods for Nonconvex OptimizationComments: 25 pages, 2 tables. Published in Transactions on Machine Learning Research (TMLR)Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a finegrained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a firstorder stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives.
 [113] arXiv:2009.04544 (replaced) [pdf, html, other]

Title: SelfAdaptive PhysicsInformed Neural Networks using a Soft Attention MechanismComments: 24 pages, 17 figures. Published in journal form as "SelfAdaptive PhysicsInformed Neural Networks"Journalref: Journal of Computational Physics (2023), Vol. 474, 111722Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
PhysicsInformed Neural Networks (PINNs) have emerged recently as a promising application of deep neural networks to the numerical solution of nonlinear partial differential equations (PDEs). However, it has been recognized that adaptive procedures are needed to force the neural network to fit accurately the stubborn spots in the solution of "stiff" PDEs. In this paper, we propose a fundamentally new way to train PINNs adaptively, where the adaptation weights are fully trainable and applied to each training point individually, so the neural network learns autonomously which regions of the solution are difficult and is forced to focus on them. The selfadaptation weights specify a soft multiplicative soft attention mask, which is reminiscent of similar mechanisms used in computer vision. The basic idea behind these SAPINNs is to make the weights increase as the corresponding losses increase, which is accomplished by training the network to simultaneously minimize the losses and maximize the weights. In addition, we show how to build a continuous map of selfadaptive weights using Gaussian Process regression, which allows the use of stochastic gradient descent in problems where conventional gradient descent is not enough to produce accurate solutions. Finally, we derive the Neural Tangent Kernel matrix for SAPINNs and use it to obtain a heuristic understanding of the effect of the selfadaptive weights on the dynamics of training in the limiting case of infinitelywide PINNs, which suggests that SAPINNs work by producing a smooth equalization of the eigenvalues of the NTK matrix corresponding to the different loss terms. In numerical experiments with several linear and nonlinear benchmark problems, the SAPINN outperformed other stateoftheart PINN algorithm in L2 error, while using a smaller number of training epochs.
 [114] arXiv:2206.09642 (replaced) [pdf, html, other]

Title: Beyond IID: datadriven decisionmaking in heterogeneous environmentsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
How should one leverage historical data when past observations are not perfectly indicative of the future, e.g., due to the presence of unobserved confounders which one cannot "correct" for? Motivated by this question, we study a datadriven decisionmaking framework in which historical samples are generated from unknown and different distributions assumed to lie in a heterogeneity ball with known radius and centered around the (also) unknown future (outofsample) distribution on which the performance of a decision will be evaluated. This work aims at analyzing the performance of central datadriven policies but also nearoptimal ones in these heterogeneous environments and understanding key drivers of performance. We establish a first result which allows to upper bound the asymptotic worstcase regret of a broad class of policies. Leveraging this result, for any integral probability metric, we provide a general analysis of the performance achieved by Sample Average Approximation (SAA) as a function of the radius of the heterogeneity ball. This analysis is centered around the approximation parameter, a notion of complexity we introduce to capture how the interplay between the heterogeneity and the problem structure impacts the performance of SAA. In turn, we illustrate through several widelystudied problems  e.g., newsvendor, pricing  how this methodology can be applied and find that the performance of SAA varies considerably depending on the combinations of problem classes and heterogeneity. The failure of SAA for certain instances motivates the design of alternative policies to achieve rateoptimality. We derive problemdependent policies achieving strong guarantees for the illustrative problems described above and provide initial results towards a principled approach for the design and analysis of general rateoptimal algorithms.
 [115] arXiv:2210.00898 (replaced) [pdf, html, other]

Title: Robust $Q$learning Algorithm for Markov Decision Processes under Wasserstein UncertaintySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
We present a novel $Q$learning algorithm tailored to solve distributionally robust Markov decision problems where the corresponding ambiguity set of transition probabilities for the underlying Markov decision process is a Wasserstein ball around a (possibly estimated) reference measure. We prove convergence of the presented algorithm and provide several examples also using real data to illustrate both the tractability of our algorithm as well as the benefits of considering distributional robustness when solving stochastic optimal control problems, in particular when the estimated distributions turn out to be misspecified in practice.
 [116] arXiv:2301.13006 (replaced) [pdf, html, other]

Title: Fast Computation of Optimal Transport via EntropyRegularized Extragradient MethodsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
Efficient computation of the optimal transport distance between two distributions serves as an algorithm subroutine that empowers various applications. This paper develops a scalable firstorder optimizationbased method that computes optimal transport to within $\varepsilon$ additive accuracy with runtime $\widetilde{O}( n^2/\varepsilon)$, where $n$ denotes the dimension of the probability distributions of interest. Our algorithm achieves the stateoftheart computational guarantees among all firstorder methods, while exhibiting favorable numerical performance compared to classical algorithms like Sinkhorn and Greenkhorn. Underlying our algorithm designs are two key elements: (a) converting the original problem into a bilinear minimax problem over probability distributions; (b) exploiting the extragradient idea  in conjunction with entropy regularization and adaptive learning rates  to accelerate convergence.
 [117] arXiv:2302.02224 (replaced) [pdf, html, other]

Title: TAP: The Attention Patch for CrossModal Knowledge Transfer from Unlabeled ModalityComments: Accepted to TMLRSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper addresses a crossmodal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via NadarayaWatson (NW) kernel regression, which can further be expressed as a kernelized crossattention module (under linear transformation). This expression lays the foundation for introducing The Attention Patch (TAP), a simple neural network addon that can be trained to allow datalevel knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using realworld datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled crossmodal data.
 [118] arXiv:2306.06247 (replaced) [pdf, html, other]

Title: Online Learning with SetValued FeedbackComments: Accepted to COLT 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study a variant of online multiclass classification where the learner predicts a single label but receives a \textit{set of labels} as feedback. In this model, the learner is penalized for not outputting a label contained in the revealed set. We show that unlike online multiclass learning with singlelabel feedback, deterministic and randomized online learnability are \textit{not equivalent} even in the realizable setting with setvalued feedback. Accordingly, we give two new combinatorial dimensions, named the Set Littlestone and Measure Shattering dimension, that tightly characterize deterministic and randomized online learnability respectively in the realizable setting. In addition, we show that the Measure Shattering dimension characterizes online learnability in the agnostic setting and tightly quantifies the minimax regret. Finally, we use our results to establish bounds on the minimax regret for three practical learning settings: online multilabel ranking, online multilabel classification, and realvalued prediction with intervalvalued response.
 [119] arXiv:2310.19064 (replaced) [pdf, html, other]

Title: Apple Tasting: Combinatorial Dimensions and Minimax RatesComments: 21 pages, COLT 2024 Camera ReadySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In online binary classification under \emph{apple tasting} feedback, the learner only observes the true label if it predicts ``1". First studied by \cite{helmbold2000apple}, we revisit this classical partialfeedback setting and study online learnability from a combinatorial perspective. We show that the Littlestone dimension continues to provide a tight quantitative characterization of apple tasting in the agnostic setting, closing an open question posed by \cite{helmbold2000apple}. In addition, we give a new combinatorial parameter, called the Effective width, that tightly quantifies the minimax expected mistakes in the realizable setting. As a corollary, we use the Effective width to establish a \emph{trichotomy} of the minimax expected number of mistakes in the realizable setting. In particular, we show that in the realizable setting, the expected number of mistakes of any learner, under apple tasting feedback, can be $\Theta(1), \Theta(\sqrt{T})$, or $\Theta(T)$. This is in contrast to the fullinformation realizable setting where only $\Theta(1)$ and $\Theta(T)$ are possible.
 [120] arXiv:2311.06436 (replaced) [pdf, other]

Title: Augmented Degree Correction for Bipartite Networks with Applications to Recommender SystemsComments: 21 pages, 4 figuresJournalref: Appl Netw Sci 9, 19 (2024)Subjects: Social and Information Networks (cs.SI); Applications (stat.AP)
In recommender systems, users rate items, and are subsequently served other product recommendations based on these ratings. Even though users usually rate a tiny percentage of the available items, the system tries to estimate unobserved preferences by finding similarities across users and across items. In this work, we treat the observed ratings data as partially observed, dense, weighted, bipartite networks. For a class of systems without outside information, we adapt an approach developed for dense, weighted networks to account for unobserved edges and the bipartite nature of the problem. This approach allows for community structure, and for local estimation of flexible patterns of ratings across different pairs of communities. We compare the performance of our proposed approach to existing methods on a simulated data set, as well as on a data set of joke ratings, examining model performance in both cases at differing levels of sparsity.
 [121] arXiv:2311.17539 (replaced) [pdf, html, other]

Title: Critical Influence of Overparameterization on Sharpnessaware MinimizationSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Training an overparameterized neural network can yield minimizers of different generalization capabilities despite the same level of training loss. Meanwhile, with evidence that suggests a strong correlation between the sharpness of minima and their generalization errors, increasing efforts have been made to develop optimization methods to explicitly find flat minima as more generalizable solutions. Despite its contemporary relevance to overparameterization, however, this sharpnessaware minimization (SAM) strategy has not been studied much yet as to exactly how it is affected by overparameterization. Hence, in this work, we analyze SAM under overparameterization of varying degrees and present both empirical and theoretical results that indicate a critical influence of overparameterization on SAM. At first, we conduct extensive numerical experiments across vision, language, graph, and reinforcement learning domains and show that SAM consistently improves with overparameterization. Next, we attribute this phenomenon to the interplay between the enlarged solution space and increased implicit bias from overparameterization. Further, we prove multiple theoretical benefits of overparameterization for SAM to attain (i) minima with more uniform Hessian moments compared to SGD, (ii) much faster convergence at a linear rate, and (iii) lower test error for twolayer networks. Last but not least, we discover that the effect of overparameterization is more significantly pronounced in practical settings of label noise and sparsity, and yet, sufficient regularization is necessary.
 [122] arXiv:2312.01541 (replaced) [pdf, html, other]

Title: Revisiting Nonseparable Binary Classification and its Applications in Anomaly DetectionComments: Accepted in Transactions on Machine Learning Research (TMLR) 2024. Code: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The inability to linearly classify XOR has motivated much of deep learning. We revisit this ageold problem and show that linear classification of XOR is indeed possible. Instead of separating data between halfspaces, we propose a slightly different paradigm, equality separation, that adapts the SVM objective to distinguish data within or outside the margin. Our classifier can then be integrated into neural network pipelines with a smooth approximation. From its properties, we intuit that equality separation is suitable for anomaly detection. To formalize this notion, we introduce closing numbers, a quantitative measure on the capacity for classifiers to form closed decision regions for anomaly detection. Springboarding from this theoretical connection between binary classification and anomaly detection, we test our hypothesis on supervised anomaly detection experiments, showing that equality separation can detect both seen and unseen anomalies.
 [123] arXiv:2312.06071 (replaced) [pdf, html, other]

Title: Precipitation Downscaling with Spatiotemporal Video DiffusionPrakhar Srivastava, Ruihan Yang, Gavin Kerrigan, Gideon Dresdner, Jeremy McGibbon, Christopher Bretherton, Stephan MandtSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.aoph); Machine Learning (stat.ML)
In climate science and meteorology, highresolution local precipitation (rain and snowfall) predictions are limited by the computational costs of simulationbased methods. Statistical downscaling, or superresolution, is a common workaround where a lowresolution prediction is improved using statistical approaches. Unlike traditional computer vision tasks, weather and climate applications require capturing the accurate conditional distribution of highresolution given lowresolution patterns to assure reliable ensemble averages and unbiased estimates of extreme events, such as heavy rain. This work extends recent video diffusion models to precipitation superresolution, employing a deterministic downscaler followed by a temporallyconditioned diffusion model to capture noise characteristics and highfrequency patterns. We test our approach on FV3GFS output, an established largescale global atmosphere model, and compare it against six stateoftheart baselines. Our analysis, capturing CRPS, MSE, precipitation distributions, and qualitative aspects using California and the Himalayas as examples, establishes our method as a new standard for datadriven precipitation downscaling.
 [124] arXiv:2312.17007 (replaced) [pdf, other]

Title: On the rate of convergence of an overparametrized Transformer classifier learned by gradient descentSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
One of the most recent and fascinating breakthroughs in artificial intelligence is ChatGPT, a chatbot which can simulate human conversation. ChatGPT is an instance of GPT4, which is a language model based on generative gredictive gransformers. So if one wants to study from a theoretical point of view, how powerful such artificial intelligence can be, one approach is to consider transformer networks and to study which problems one can solve with these networks theoretically. Here it is not only important what kind of models these network can approximate, or how they can generalize their knowledge learned by choosing the best possible approximation to a concrete data set, but also how well optimization of such transformer network based on concrete data set works. In this article we consider all these three different aspects simultaneously and show a theoretical upper bound on the missclassification probability of a transformer network fitted to the observed data. For simplicity we focus in this context on transformer encoder networks which can be applied to define an estimate in the context of a classification problem involving natural language.
 [125] arXiv:2402.01865 (replaced) [pdf, html, other]

Title: What Will My Model Forget? Forecasting Forgotten Examples in Language Model RefinementComments: To appear at ICML 2024 (Spotlight)Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Language models deployed in the wild make errors. However, simply updating the model with the corrected error instances causes catastrophic forgetting  the updated model makes errors on instances learned during the instruction tuning or upstream training phase. Randomly replaying upstream data yields unsatisfactory performance and often comes with high variance and poor controllability. To this end, we try to forecast upstream examples that will be forgotten due to a model update for improved controllability of the replay process and interpretability. We train forecasting models given a collection of online learned examples and corresponding forgotten upstream pretraining examples. We propose a partially interpretable forecasting model based on the observation that changes in presoftmax logit scores of pretraining examples resemble that of online learned examples, which performs decently on BART but fails on T5 models. We further show a blackbox classifier based on inner products of example representations achieves better forecasting performance over a series of setups. Finally, we show that we reduce forgetting of upstream pretraining examples by replaying examples that are forecasted to be forgotten, demonstrating the practical utility of forecasting example forgetting.
 [126] arXiv:2402.02720 (replaced) [pdf, other]

Title: Discounted Adaptive Online Learning: Towards Better RegularizationComments: ICML 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study online learning in adversarial nonstationary environments. Since the future can be very different from the past, a critical challenge is to gracefully forget the history while new data comes in. To formalize this intuition, we revisit the discounted regret in online convex optimization, and propose an adaptive (i.e., instance optimal), FTRLbased algorithm that improves the widespread nonadaptive baseline  gradient descent with a constant learning rate. From a practical perspective, this refines the classical idea of regularization in lifelong learning: we show that designing good regularizers can be guided by the principled theory of adaptive online optimization.
Complementing this result, we also consider the (Gibbs and Candès, 2021)style online conformal prediction problem, where the goal is to sequentially predict the uncertainty sets of a blackbox machine learning model. We show that the FTRL nature of our algorithm can simplify the conventional gradientdescentbased analysis, leading to instancedependent performance guarantees.  [127] arXiv:2402.05210 (replaced) [pdf, html, other]

Title: AnatomicallyControllable Medical Image Generation with SegmentationGuided Diffusion ModelsComments: Accepted at MICCAI 2024. Code and synthetic dataset: this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models have enabled remarkably highquality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. To this end, we propose a diffusion modelbased method that supports anatomicallycontrollable medical image generation, by following a multiclass anatomical segmentation mask at each sampling step. We additionally introduce a random mask ablation training algorithm to enable conditioning on a selected combination of anatomical constraints while allowing flexibility in other anatomical areas. We compare our method ("SegGuidedDiff") to existing methods on breast MRI and abdominal/necktopelvis CT datasets with a wide range of anatomical objects. Results show that our method reaches a new stateoftheart in the faithfulness of generated images to input anatomical masks on both datasets, and is on par for general anatomical realism. Finally, our model also enjoys the extra benefit of being able to adjust the anatomical similarity of generated images to real images of choice through interpolation in its latent space. SegGuidedDiff has many applications, including crossmodality translation, and the generation of paired or counterfactual data. Our code is available at this https URL.
 [128] arXiv:2402.07514 (replaced) [pdf, other]

Title: Physicsinformed machine learning as a kernel methodNathan Doumèche (LPSM (UMR\_8001), EDF R\&D OSIRIS), Francis Bach (DIENS, SIERRA), Gérard Biau (LPSM (UMR\_8001)), Claire Boyer (IUF, LPSM (UMR\_8001))Subjects: Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
Physicsinformed machine learning combines the expressiveness of databased approaches with the interpretability of physical models. In this context, we consider a general regression problem where the empirical risk is regularized by a partial differential equation that quantifies the physical inconsistency. We prove that for linear differential priors, the problem can be formulated as a kernel regression task. Taking advantage of kernel theory, we derive convergence rates for the minimizer of the regularized risk and show that it converges at least at the Sobolev minimax rate. However, faster rates can be achieved, depending on the physical error. This principle is illustrated with a onedimensional example, supporting the claim that regularizing the empirical risk with physical information can be beneficial to the statistical performance of estimators.
 [129] arXiv:2402.08922 (replaced) [pdf, html, other]

Title: The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward PassesJournalref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Largescale blackbox models have become ubiquitous across numerous applications. Understanding the influence of individual training data sources on predictions made by these models is crucial for improving their trustworthiness. Current influence estimation techniques involve computing gradients for every training point or repeated training on different subsets. These approaches face obvious computational challenges when scaled up to large datasets and models.
In this paper, we introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data. Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem: assessing how the predictions for training samples would be altered if the model were trained on specific test samples. Through both empirical and theoretical validations, we demonstrate the wide applicability of our hypothesis. Inspired by this, we introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point. This approach can capitalize on the common asymmetry in scenarios where the number of test samples under concurrent examination is much smaller than the scale of the training dataset, thus gaining a significant improvement in efficiency compared to existing approaches.
We demonstrate the applicability of our method across a range of scenarios, including data attribution in diffusion models, data leakage detection, analysis of memorization, mislabeled data detection, and tracing behavior in language models. Our code will be made available at this https URL.  [130] arXiv:2402.18213 (replaced) [pdf, other]

Title: Multiobjective Differentiable Neural Architecture SearchComments: 37 pages, 27 figuresSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Pareto front profiling in multiobjective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives like neural network training. Typically, in MOO neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints into the objective function, but profiling the Pareto front necessitates a computationally expensive search for each constraint. In this work, we propose a novel NAS algorithm that encodes user preferences for the tradeoff between performance and hardware metrics, and yields representative and diverse architectures across multiple devices in just one search run. To this end, we parameterize the joint architectural distribution across devices and multiple objectives via a hypernetwork that can be conditioned on hardware features and preference vectors, enabling zeroshot transferability to new devices. Extensive experiments with up to 19 hardware devices and 3 objectives showcase the effectiveness and scalability of our method. Finally, we show that, without extra costs, our method outperforms existing MOO NAS methods across a broad range of qualitatively different search spaces and datasets, including MobileNetV3 on ImageNet1k, an encoderdecoder transformer space for machine translation and a decoderonly transformer space for language modelling.
 [131] arXiv:2404.10044 (replaced) [pdf, html, other]

Title: Variational quantum simulation: a case study for understanding warm startsComments: 9 + 26 pages, 5 + 2 figuresSubjects: Quantum Physics (quantph); Machine Learning (cs.LG); Machine Learning (stat.ML)
The barren plateau phenomenon, characterized by loss gradients that vanish exponentially with system size, poses a challenge to scaling variational quantum algorithms. Here we explore the potential of warm starts, whereby one initializes closer to a solution in the hope of enjoying larger loss variances. Focusing on an iterative variational method for learning shorterdepth circuits for quantum real and imaginary time evolution we conduct a case study to elucidate the potential and limitations of warm starts. We start by proving that the iterative variational algorithm will exhibit substantial (at worst vanishing polynomially in system size) gradients in a small region around the initializations at each timestep. Convexity guarantees for these regions are then established, suggesting trainability for polynomial size timesteps. However, our study highlights scenarios where a good minimum shifts outside the region with trainability guarantees. Our analysis leaves open the question whether such minima jumps necessitate optimization across barren plateau landscapes or whether there exist gradient flows, i.e., fertile valleys away from the plateau with substantial gradients, that allow for training.
 [132] arXiv:2404.10427 (replaced) [pdf, other]

Title: Effect of Systematic Uncertainties on Density and Temperature Estimates in Coronae of CapellaXixi Yu, Vinay L. Kashyap, Giulio Del Zanna, David A. van Dyk, David C. Stenning, Connor P. Ballance, Harry P. WarrenSubjects: Solar and Stellar Astrophysics (astroph.SR); Methodology (stat.ME)
We estimate the coronal density of Capella using the O VII and Fe XVII line systems in the soft Xray regime that have been observed over the course of the Chandra mission. Our analysis combines measures of error due to uncertainty in the underlying atomic data with statistical errors in the Chandra data to derive meaningful overall uncertainties on the plasma density of the coronae of Capella. We consider two Bayesian frameworks. First, the socalled pragmaticBayesian approach considers the atomic data and their uncertainties as fully specified and uncorrectable. The fullyBayesian approach, on the other hand, allows the observed spectral data to update the atomic data and their uncertainties, thereby reducing the overall errors on the inferred parameters. To incorporate atomic data uncertainties, we obtain a set of atomic data replicates, the distribution of which captures their uncertainty. A principal component analysis of these replicates allows us to represent the atomic uncertainty with a lowerdimensional multivariate Gaussian distribution. A $t$distribution approximation of the uncertainties of a subset of plasma parameters including a priori temperature information, obtained from the temperaturesensitiveonly Fe XVII spectral line analysis, is carried forward into the density and temperaturesensitive O VII spectral line analysis. Markov Chain Monte Carlo based model fitting is implemented including Multistep Monte Carlo Gibbs Sampler and Hamiltonian Monte Carlo. Our analysis recovers an isothermally approximated coronal plasma temperature of $\approx$5 MK and a coronal plasma density of $\approx$10$^{10}$ cm$^{3}$, with uncertainties of 0.1 and 0.2 dex respectively.
 [133] arXiv:2404.17293 (replaced) [pdf, html, other]

Title: Lazy Data Practices Harm Fairness ResearchJournalref: FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (2024) 642659Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)
Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications.
Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.  [134] arXiv:2405.02612 (replaced) [pdf, html, other]

Title: Learning Linear Utility Functions From Pairwise Comparison QueriesComments: Submitted to ECAI for reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
We study learnability of linear utility functions from pairwise comparison queries. In particular, we consider two learning objectives. The first objective is to predict outofsample responses to pairwise comparisons, whereas the second is to approximately recover the true parameters of the utility function. We show that in the passive learning setting, linear utilities are efficiently learnable with respect to the first objective, both when query responses are uncorrupted by noise, and under Tsybakov noise when the distributions are sufficiently "nice". In contrast, we show that utility parameters are not learnable for a large set of data distributions without strong modeling assumptions, even when query responses are noisefree. Next, we proceed to analyze the learning problem in an active learning setting. In this case, we show that even the second objective is efficiently learnable, and present algorithms for both the noisefree and noisy query response settings. Our results thus exhibit a qualitative learnability gap between passive and active learning from pairwise preference queries, demonstrating the value of the ability to select pairwise queries for utility learning.
 [135] arXiv:2405.02881 (replaced) [pdf, html, other]

Title: FedConPE: Efficient Federated Conversational Bandits with Heterogeneous ClientsComments: Accepted to the 33rd International Joint Conference on Artificial Intelligence (IJCAI), 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Conversational recommender systems have emerged as a potent solution for efficiently eliciting user preferences. These systems interactively present queries associated with "key terms" to users and leverage user feedback to estimate user preferences more efficiently. Nonetheless, most existing algorithms adopt a centralized approach. In this paper, we introduce FedConPE, a phase eliminationbased federated conversational bandit algorithm, where $M$ agents collaboratively solve a global contextual linear bandit problem with the help of a central server while ensuring secure data management. To effectively coordinate all the clients and aggregate their collected data, FedConPE uses an adaptive approach to construct key terms that minimize uncertainty across all dimensions in the feature space. Furthermore, compared with existing federated linear bandit algorithms, FedConPE offers improved computational and communication efficiency as well as enhanced privacy protections. Our theoretical analysis shows that FedConPE is minimax nearoptimal in terms of cumulative regret. We also establish upper bounds for communication costs and conversation frequency. Comprehensive evaluations demonstrate that FedConPE outperforms existing conversational bandit algorithms while using fewer conversations.
 [136] arXiv:2405.04816 (replaced) [pdf, html, other]

Title: Testing the FairnessImprovability of AlgorithmsSubjects: Econometrics (econ.EM); Data Structures and Algorithms (cs.DS); Applications (stat.AP)
Many organizations use algorithms that have a disparate impact, i.e., the benefits or harms of the algorithm fall disproportionately on certain social groups. Addressing an algorithm's disparate impact can be challenging, especially because it is often unclear whether reducing this impact is possible without sacrificing other important objectives of the organization, such as accuracy or profit. Establishing the improvability of algorithms with respect to multiple criteria is of both conceptual and practical interest: in many settings, disparate impact that would otherwise be prohibited under US federal law is permissible if it is necessary to achieve a legitimate business interest. The question is how a policymaker can formally substantiate, or refute, this necessity defense. In this paper, we provide an econometric framework for testing the hypothesis that it is possible to improve on the fairness of an algorithm without compromising on other prespecified objectives. Our proposed test is simple to implement and can be applied under any exogenous constraint on the algorithm space. We establish the largesample validity and consistency of our test, and illustrate its practical application by evaluating a healthcare algorithm originally considered by Obermeyer et al 2019. In this application, we reject the null hypothesis that it is not possible to reduce the algorithm's disparate impact without compromising on the accuracy of its predictions.
 [137] arXiv:2405.05097 (replaced) [pdf, html, other]

Title: Biologyinspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networksComments: 6 pages, 4 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Popular artificial neural networks (ANN) optimize parameters for unidirectional value propagation, assuming some arbitrary parametrization type like MultiLayer Perceptron (MLP) or KolmogorovArnold Network (KAN). In contrast, for biological neurons e.g. "it is not uncommon for axonal propagation of action potentials to happen in both directions"~\cite{axon}  suggesting they are optimized to continuously operate in multidirectional way. Additionally, statistical dependencies a single neuron could model is not just (expected) value dependence, but entire joint distributions including also higher moments. Such more agnostic joint distribution neuron would allow for multidirectional propagation (of distributions or values) e.g. $\rho(xy,z)$ or $\rho(y,zx)$ by substituting to $\rho(x,y,z)$ and normalizing. There will be discussed Hierarchical Correlation Reconstruction (HCR) for such neuron model: assuming $\rho(x,y,z)=\sum_{ijk} a_{ijk} f_i(x) f_j(y) f_k(z)$ type parametrization of joint distribution in polynomial basis $f_i$, which allows for flexible, inexpensive processing including nonlinearities, direct model estimation and update, trained through standard backpropagation or novel ways for such structure up to tensor decomposition or information bottleneck approach. Using only pairwise (inputoutput) dependencies, its expected value prediction becomes KANlike with trained activation functions as polynomials, can be extended by adding higher order dependencies through included products  in conscious interpretable way, allowing for multidirectional propagation of both values and probability densities.
 [138] arXiv:2405.10027 (replaced) [pdf, html, other]

Title: The Real Price of Bandit Information in Multiclass ClassificationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We revisit the classical problem of multiclass classification with bandit feedback (Kakade, ShalevShwartz and Tewari, 2008), where each input classifies to one of $K$ possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels $K$, and whether $T$step regret bounds in this setting can be improved beyond the $\smash{\sqrt{KT}}$ dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form $\smash{\widetilde{\Theta}\left(\min \left\{H + \sqrt{T}, \sqrt{KT \log H} \right\} \right) }$, where $H$ is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret $\smash{\widetilde{O}(H+\sqrt{T})}$, improving over classical algorithms for moderatelysized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to logfactors) in all parameter regimes.
 [139] arXiv:2405.20231 (replaced) [pdf, html, other]

Title: The Empirical Impact of Neural Parameter Symmetries, or Lack ThereofComments: 27 pages. Preparing code for release. v2: added / updated some citationsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries  transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or losslandscapes. However, theoretical analysis of the relationship between parameter space symmetries and these phenomena is difficult. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity between our networks without alignment of weight spaces, and we find that our networks allow for faster and more effective Bayesian neural network training.
 [140] arXiv:2405.20528 (replaced) [pdf, html, other]

Title: On the Convergence of the SinkhornKnopp Algorithm with Sparse Cost MatricesSubjects: Optimization and Control (math.OC); Statistics Theory (math.ST)
Matrix scaling problems with sparse cost matrices arise frequently in various domains, such as optimal transport, image processing, and machine learning. The SinkhornKnopp algorithm is a popular iterative method for solving these problems, but its convergence properties in the presence of sparsity have not been thoroughly analyzed. This paper presents a theoretical analysis of the convergence rate of the SinkhornKnopp algorithm specifically for sparse cost matrices. We derive novel bounds on the convergence rate that explicitly depend on the sparsity pattern and the degree of nonsparsity of the cost matrix. These bounds provide new insights into the behavior of the algorithm and highlight the potential for exploiting sparsity to develop more efficient solvers. We also explore connections between our sparse convergence results and existing convergence results for dense matrices, showing that our bounds generalize the dense case. Our analysis reveals that the convergence rate improves as the matrix becomes less sparse and as the minimum entry of the cost matrix increases relative to its maximum entry. These findings have important practical implications, suggesting that the SinkhornKnopp algorithm may be particularly wellsuited for largescale matrix scaling problems with sparse cost matrices arising in realworld applications. Future research directions include investigating tighter bounds based on more sophisticated sparsity patterns, developing algorithm variants that actively exploit sparsity, and empirically validating the benefits of our theoretical results on realworld datasets. This work advances our understanding of the SinkhornKnopp algorithm for an important class of matrix scaling problems and lays the foundation for designing more efficient and scalable solutions in practice.
 [141] arXiv:2406.10719 (replaced) [pdf, html, other]

Title: Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approachComments: (Last update) Stochastic investment models and a Bayesian approach to better modeling of uncertainty : adversarial machine learning or Stochastic market. arXiv admin note: substantial text overlap with arXiv:2402.05967Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Computational Finance (qfin.CP); Statistical Finance (qfin.ST); Machine Learning (stat.ML)
With the growing use of voiceactivated systems and speech recognition technologies, the danger of backdoor attacks on audio data has grown significantly. This research looks at a specific type of attack, known as a Stochastic investmentbased backdoor attack (MarketBack), in which adversaries strategically manipulate the stylistic properties of audio to fool speech recognition systems. The security and integrity of machine learning models are seriously threatened by backdoor attacks, in order to maintain the reliability of audio applications and systems, the identification of such attacks becomes crucial in the context of audio data. Experimental results demonstrated that MarketBack is feasible to achieve an average attack success rate close to 100% in seven victim models when poisoning less than 1% of the training data.
 [142] arXiv:2406.11151 (replaced) [pdf, html, other]

Title: Recent and Upcoming Developments in Randomized Numerical Linear Algebra for Machine LearningSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Large matrices arise in many machine learning and data analysis applications, including as representations of datasets, graphs, model weights, and first and secondorder derivatives. Randomized Numerical Linear Algebra (RandNLA) is an area which uses randomness to develop improved algorithms for ubiquitous matrix problems. The area has reached a certain level of maturity; but recent hardware trends, efforts to incorporate RandNLA algorithms into core numerical libraries, and advances in machine learning, statistics, and random matrix theory, have lead to new theoretical and practical challenges. This article provides a selfcontained overview of RandNLA, in light of these developments.
 [143] arXiv:2406.12649 (replaced) [pdf, html, other]

Title: Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation ModelsComments: Accepted at ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Vision transformers (ViTs) have emerged as a significant area of focus, particularly for their capacity to be jointly trained with large language models and to serve as robust vision foundation models. Yet, the development of trustworthy explanation methods for ViTs has lagged, particularly in the context of posthoc interpretations of ViT predictions. Existing subimage selection approaches, such as featureattribution and conceptual models, fall short in this regard. This paper proposes five desiderata for explaining ViTs  faithfulness, stability, sparsity, multilevel structure, and parsimony  and demonstrates the inadequacy of current methods in meeting these criteria comprehensively. We introduce a variational Bayesian explanation framework, dubbed ProbAbilistic Concept Explainers (PACE), which models the distributions of patch embeddings to provide trustworthy posthoc conceptual explanations. Our qualitative analysis reveals the distributions of patchlevel concepts, elucidating the effectiveness of ViTs by modeling the joint distribution of patch embeddings and ViT's predictions. Moreover, these patchlevel explanations bridge the gap between imagelevel and datasetlevel explanations, thus completing the multilevel structure of PACE. Through extensive experiments on both synthetic and realworld datasets, we demonstrate that PACE surpasses stateoftheart methods in terms of the defined desiderata.