The errors after modeling, however, should be normal to draw a valid conclusion by hypothesis testing. None of your observed variables have to be normal in linear regression analysis, which includes t-test and ANOVA. In that case transforming one or both variables may be necessary. The relationship between two variables may also be non-linear (which you might detect with a scatterplot). Is it count data or reaction time? In such cases, you may want to transform it or use other analysis methods (e.g., generalized linear models or nonparametric methods). Do they look reasonable? Your data might not be normal for a reason. Why do we even bother checking histogram before analysis then?Īlthough your data don’t have to be normal, it’s still a good idea to check data distributions just to understand your data. Okay, I understand my variables don’t have to be normal. But otherwise you can probably rest easy if your errors seem “normal enough”. Now if your sample is small (less than 30) and you detect extremely non-normal errors, you might consider alternatives to constructing standard errors and p-values, such as bootstrapping. (You can’t get any more non-normal than that!) And yet the sampling distribution histogram of the coefficient was not as far from normal as you might expect. In our second example above, our simulated sample size was 30 (kind of small) and our errors were drawn from a chi-square distribution with 1 degree of freedom. In short, if the normality assumption of the errors is not met, we cannot draw a valid conclusion based on statistical inference in linear regression analysis.Īnd even then those procedures are actually pretty robust to violations of normality. When errors are not normally distributed, estimations are not normally distributed and we can no longer use p-values to decide if the coefficient is different from zero. That means that in Case 2 we cannot apply hypothesis testing, which is based on a normal distribution (or related distributions, such as a t-distribution). The distribution of estimated coefficients follows a normal distribution in Case 1, but not in Case 2. Main='Case 2: Non-normal Errors', xlab='Coefficient Estimation')Ĭurve(dnorm(x, mean=mean(results2$est), sd=sd(results2$est)), Hist(results2$est, breaks=100, freq=FALSE, # The estimates are NOT normally distributed in Case 2 Main='Case 1: Normal Errors', xlab='Coefficient Estimation')Ĭurve(dnorm(x, mean=mean(results1$est), sd=sd(results1$est)), Hist(results1$est, breaks=100, freq=FALSE, # The estimates are normally distributed in Case 1 Let’s do some simulations and see how normality influences analysis results and see what could be consequences of normality violation. In linear regression, errors are assumed to follow a normal distribution with a mean of zero. Yes, you should check normality of errors AFTER modeling. No way! When I learned regression analysis, I remember my stats professor said we should check normality! Linear regression analysis, which includes t-test and ANOVA, does not assume normality for either predictors (IV) or an outcome (DV). No, you don’t have to transform your observed variables just because they don’t follow a normal distribution. I should transform them first or I can’t run any analyses.” That’s why stats textbooks show you how to draw histograms and QQ-plots in the beginning of data analysis in the early chapters and see if they’re normally distributed, isn’t it? There I was, drawing histograms, looking at the shape and thinking, “Oh, no, my data are not normal. I thought normal distribution of variables was the important assumption to proceed to analyses. When I first learned data analysis, I always checked normality for each variable and made sure they were normally distributed before running any analyses, such as t-test, ANOVA, or linear regression.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |