**regression**

[1908.05355] The generalization error of random features regression: Precise asymptotics and double descent curve

yesterday by cshalizi

"Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data.

"This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity increases, the generalization error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the generalization error is found in this overparametrized regime, often when the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates.

"In this paper we consider the problem of learning an unknown function over the d-dimensional sphere 𝕊d−1, from n i.i.d. samples (xi,yi)∈𝕊d−1×ℝ, i≤n. We perform ridge regression on N random features of the form σ(w𝖳ax), a≤N. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the generalization error, in the limit N,n,d→∞ with N/d and n/d fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon."

to:NB
learning_theory
regression
random_projections
statistics
montanari.andrea
"This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity increases, the generalization error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the generalization error is found in this overparametrized regime, often when the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates.

"In this paper we consider the problem of learning an unknown function over the d-dimensional sphere 𝕊d−1, from n i.i.d. samples (xi,yi)∈𝕊d−1×ℝ, i≤n. We perform ridge regression on N random features of the form σ(w𝖳ax), a≤N. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the generalization error, in the limit N,n,d→∞ with N/d and n/d fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon."

yesterday by cshalizi

[1611.03015] Honest confidence sets in nonparametric IV regression and other ill-posed models

yesterday by cshalizi

"This paper develops inferential methods for a very general class of ill-posed models in econometrics encompassing the nonparametric instrumental regression, various functional regressions, and the density deconvolution. We focus on uniform confidence sets for the parameter of interest estimated with Tikhonov regularization, as in Darolles, Fan, Florens, and Renault (2011). Since it is impossible to have inferential methods based on the central limit theorem, we develop two alternative approaches relying on the concentration inequality and bootstrap approximations. We show that expected diameters and coverage properties of resulting sets have uniform validity over a large class of models, i.e., constructed confidence sets are honest. Monte Carlo experiments illustrate that introduced confidence sets have reasonable width and coverage properties. Using the U.S. data, we provide uniform confidence sets for Engel curves for various commodities."

to:NB
confidence_sets
nonparametrics
instrumental_variables
regression
causal_inference
yesterday by cshalizi

[1908.04427] A Groupwise Approach for Inferring Heterogeneous Treatment Effects in Causal Inference

3 days ago by cshalizi

"There is a growing literature in nonparametric estimation of the conditional average treatment effect given a specific value of covariates. However, this estimate is often difficult to interpret if covariates are high dimensional and in practice, effect heterogeneity is discussed in terms of subgroups of individuals with similar attributes. The paper propose to study treatment heterogeneity under the groupwise framework. Our method is simple, only based on linear regression and sample splitting, and is semiparametrically efficient under assumptions. We also discuss ways to conduct multiple testing. We conclude by reanalyzing a get-out-the-vote experiment during the 2014 U.S. midterm elections."

to:NB
causal_inference
regression
statistics
nonparametrics
3 days ago by cshalizi

[1605.02214] On cross-validated Lasso

3 days ago by cshalizi

"In this paper, we derive non-asymptotic error bounds for the Lasso estimator when the penalty parameter for the estimator is chosen using K-fold cross-validation. Our bounds imply that the cross-validated Lasso estimator has nearly optimal rates of convergence in the prediction, L2, and L1 norms. For example, we show that in the model with the Gaussian noise and under fairly general assumptions on the candidate set of values of the penalty parameter, the estimation error of the cross-validated Lasso estimator converges to zero in the prediction norm with the slogp/n‾‾‾‾‾‾‾‾√×log(pn)‾‾‾‾‾‾‾√ rate, where n is the sample size of available data, p is the number of covariates, and s is the number of non-zero coefficients in the model. Thus, the cross-validated Lasso estimator achieves the fastest possible rate of convergence in the prediction norm up to a small logarithmic factor log(pn)‾‾‾‾‾‾‾√, and similar conclusions apply for the convergence rate both in L2 and in L1 norms. Importantly, our results cover the case when p is (potentially much) larger than n and also allow for the case of non-Gaussian noise. Our paper therefore serves as a justification for the widely spread practice of using cross-validation as a method to choose the penalty parameter for the Lasso estimator."

to:NB
cross-validation
lasso
regression
statistics
3 days ago by cshalizi

[1908.02399] Estimation of Conditional Average Treatment Effects with High-Dimensional Data

9 days ago by cshalizi

"Given the unconfoundedness assumption, we propose new nonparametric estimators for the reduced dimensional conditional average treatment effect (CATE) function. In the first stage, the nuisance functions necessary for identifying CATE are estimated by machine learning methods, allowing the number of covariates to be comparable to or larger than the sample size. This is a key feature since identification is generally more credible if the full vector of conditioning variables, including possible transformations, is high-dimensional. The second stage consists of a low-dimensional kernel regression, reducing CATE to a function of the covariate(s) of interest. We consider two variants of the estimator depending on whether the nuisance functions are estimated over the full sample or over a hold-out sample. Building on Belloni at al. (2017) and Chernozhukov et al. (2018), we derive functional limit theory for the estimators and provide an easy-to-implement procedure for uniform inference based on the multiplier bootstrap."

to:NB
causal_inference
regression
statistics
high-dimensional_statistics
nonparametrics
kernel_estimators
9 days ago by cshalizi

[1908.02718] A Characterization of Mean Squared Error for Estimator with Bagging

9 days ago by cshalizi

"Bagging can significantly improve the generalization performance of unstable machine learning algorithms such as trees or neural networks. Though bagging is now widely used in practice and many empirical studies have explored its behavior, we still know little about the theoretical properties of bagged predictions. In this paper, we theoretically investigate how the bagging method can reduce the Mean Squared Error (MSE) when applied on a statistical estimator. First, we prove that for any estimator, increasing the number of bagged estimators N in the average can only reduce the MSE. This intuitive result, observed empirically and discussed in the literature, has not yet been rigorously proved. Second, we focus on the standard estimator of variance called unbiased sample variance and we develop an exact analytical expression of the MSE for this estimator with bagging.

"This allows us to rigorously discuss the number of iterations N and the batch size m of the bagging method. From this expression, we state that only if the kurtosis of the distribution is greater than 32, the MSE of the variance estimator can be reduced with bagging. This result is important because it demonstrates that for distribution with low kurtosis, bagging can only deteriorate the performance of a statistical prediction. Finally, we propose a novel general-purpose algorithm to estimate with high precision the variance of a sample."

to:NB
ensemble_methods
prediction
regression
statistics
"This allows us to rigorously discuss the number of iterations N and the batch size m of the bagging method. From this expression, we state that only if the kurtosis of the distribution is greater than 32, the MSE of the variance estimator can be reduced with bagging. This result is important because it demonstrates that for distribution with low kurtosis, bagging can only deteriorate the performance of a statistical prediction. Finally, we propose a novel general-purpose algorithm to estimate with high precision the variance of a sample."

9 days ago by cshalizi

[1907.12732] Local Inference in Additive Models with Decorrelated Local Linear Estimator

10 days ago by cshalizi

"Additive models, as a natural generalization of linear regression, have played an important role in studying nonlinear relationships. Despite of a rich literature and many recent advances on the topic, the statistical inference problem in additive models is still relatively poorly understood. Motivated by the inference for the exposure effect and other applications, we tackle in this paper the statistical inference problem for f′1(x0) in additive models, where f1 denotes the univariate function of interest and f′1(x0) denotes its first order derivative evaluated at a specific point x0. The main challenge for this local inference problem is the understanding and control of the additional uncertainty due to the need of estimating other components in the additive model as nuisance functions. To address this, we propose a decorrelated local linear estimator, which is particularly useful in reducing the effect of the nuisance function estimation error on the estimation accuracy of f′1(x0). We establish the asymptotic limiting distribution for the proposed estimator and then construct confidence interval and hypothesis testing procedures for f′1(x0). The variance level of the proposed estimator is of the same order as that of the local least squares in nonparametric regression, or equivalently the additive model with one component, while the bias of the proposed estimator is jointly determined by the statistical accuracies in estimating the nuisance functions and the relationship between the variable of interest and the nuisance variables. The method is developed for general additive models and is demonstrated in the high-dimensional sparse setting."

to:NB
additive_models
regression
statistics
10 days ago by cshalizi

Robust Linear Regression with Student’s $t$-Distribution | A. Solomon Kurz

10 days ago by rmhogervorst

Nice overview of effect outliers on work

bayesian
howto
regression
tdistribution
r
stan
10 days ago by rmhogervorst

The standard errors of persistence | VOX, CEPR Policy Portal

10 days ago by yorksranter

Statistics is about extracting structure from data. The difficulty with spatial noise patterns, as the coloured simulations in Figure 1 illustrate, is that they contain a lot of apparent order, like faces in clouds. This structure makes it perilously easy to unearth spurious patterns and mistake them for convincing evidence of deep historical processes.

regression
statistics
history
economichistory
geography
reproducibilitycrisis
badstatistics
10 days ago by yorksranter

Tan , Zhang : Doubly penalized estimation in additive regression with high-dimensional data

14 days ago by cshalizi

"Additive regression provides an extension of linear regression by modeling the signal of a response as a sum of functions of covariates of relatively low complexity. We study penalized estimation in high-dimensional nonparametric additive regression where functional semi-norms are used to induce smoothness of component functions and the empirical L2L2 norm is used to induce sparsity. The functional semi-norms can be of Sobolev or bounded variation types and are allowed to be different amongst individual component functions. We establish oracle inequalities for the predictive performance of such methods under three simple technical conditions: a sub-Gaussian condition on the noise, a compatibility condition on the design and the functional classes under consideration and an entropy condition on the functional classes. For random designs, the sample compatibility condition can be replaced by its population version under an additional condition to ensure suitable convergence of empirical norms. In homogeneous settings where the complexities of the component functions are of the same order, our results provide a spectrum of minimax convergence rates, from the so-called slow rate without requiring the compatibility condition to the fast rate under the hard sparsity or certain LqLq sparsity to allow many small components in the true regression function. These results significantly broaden and sharpen existing ones in the literature."

to:NB
statistics
regression
additive_models
nonparametrics
empirical_processes
14 days ago by cshalizi

Morse-Smale Regression

15 days ago by tobym

continuous (piece-wise) regression. Good dealing with clusters and outliers that would break other types of regression. Divide topology into neighborhoods (k-nearest neighbor) and regress each piece; keep best-fitting levels.

regression
ml
machinelearning
15 days ago by tobym

Jackknife+

17 days ago by csantos

This paper proposes the jackknife+, a modification of the jackknife method for providing prediction intervals based on any regression algorithm. The jackknife+ has theoretically guaranteed coverage for any data distribution and any algorithm. Here we give python code to reproduce the empirical results and plots in the paper.

regression
prediction
Statistics
by:RyanTibshirani
by:EmmanuelCandes
17 days ago by csantos