Linear regression
Many statistics used by psychologists and social scientists are special cases of the linear model. Generalizations of the linear model include an even wider range of statistical models.
Consider the following models:
- y~x or y~1+x are both examples of simple linear regression with an implicit or explicit intercept.
- y~0+x or y~ -1 +x or y~ x-1 linear regression through the origin
- y ~ A where A is a matrix of categorical factors is a classic ANOVA model.
- y ~ A + x is ANOVA with x as a covariate
- y ~A*B or y~ A + B + A*B ANOVA with interaction terms
- ...
These models can be fitted with the linear model function (lm) and then various summary statistics are available of the fit. The data set is our familiar set of Eysenck Personality Inventory and Big Five Inventory scales with Beck Depression and state and trait anxiety scales as well. The first analysis just regresses BDI on EPI Neuroticism, then we add in Trait Anxiety, and then the N * trait Anx interaction. Note that we need to 0 center the predictors when we have an interaction term if we expect to interpret the additive effects correctly. Centering is done with the scale () function. Graphical summaries of the regession show four plots: residuals as a function of the fitted values, standard errors of the residuals, a plot of the residuals versus a normal distribution, and finally, a plot of the leverage of subjects to determine outliers. Models 5 and 6 predict bdi using the BFI, and model 7 (for too much fitting) looks at the epi and bfi and the interactions of their scales. What follows are the commands for a number of demonstrations. Samples of the commands and the output may be found in the regression page.
Get the data, list the variables in the data set.
datafilename="http://personality-project.org/revelle/R/datasets/maps.mixx.epi.bfi.data" #where are the data data =read.table(datafilename,header=TRUE) #read the data file names(data) #what variables are in the data set? attach(data) #make the variables easier to use
Produces this output:
datafilename="http://personality-project.org/revelle/R/datasets/maps.mixx.epi.bfi.data" #where are the data > data =read.table(datafilename,header=TRUE) #read the data file > names(data) #what variables are in the data set? [1] "epiE" "epiS" "epiImp" "epilie" "epiNeur" "bfagree" [7] "bfcon" "bfext" "bfneur" "bfopen" "bdi" "traitanx" [13] "stateanx" attach(data) #make the variables easier to use
First show the zero order correlation scatter plot, then do the first regression and plot the diagnostics.
layout(1) #a one panel graph plot(epiNeur,bdi) #x, y plot lines(lowess(epiNeur,bdi)) #fit with lowess function
Do the regression
model1 = lm(bdi~epiNeur) #simple regression of beck depression on Neuroticism
summary(model1) # basic statistical summary
#pass parameters to the graphics device
par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
pty = "s") # square plotting region,
# independent of device size
plot(model1) #diagnostic plots in the graphics window
> model1 = lm(bdi~epiNeur) #simple regression of beck depression on Neuroticism
> summary(model1) # basic statistical summary
Call:
lm(formula = bdi ~ epiNeur)
Residuals:
Min 1Q Median 3Q Max
-11.9548 -3.1577 -0.7707 2.0452 16.4092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.32129 0.73070 -0.44 0.66
epiNeur 0.68200 0.06353 10.74 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.721 on 229 degrees of freedom
Multiple R-Squared: 0.3348, Adjusted R-squared: 0.3319
F-statistic: 115.3 on 1 and 229 DF, p-value: < 2.2e-16
>
> #pass parameters to the graphics device
> op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
+ pty = "s") # square plotting region,
> # independent of device size
>
>
> plot(model1) #diagnostic plots in the graphics window
Model 2 adds in trait anxiety.
model2=lm(bdi~epiNeur+traitanx) #add in trait anxiety summary(model2) #basic output plot(model2) anova(model1,model2) #compare the difference between the two models
Gives us:
> model2=lm(bdi~epiNeur+traitanx) #add in trait anxiety
> summary(model2) #basic output
Call:
lm(formula = bdi ~ epiNeur + traitanx)
Residuals:
Min 1Q Median 3Q Max
-12.0288 -2.6848 -0.5121 1.9823 13.3081
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.63779 1.24729 -6.124 3.97e-09 ***
epiNeur 0.25506 0.08448 3.019 0.00282 **
traitanx 0.30151 0.04347 6.935 4.13e-11 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.299 on 228 degrees of freedom
Multiple R-Squared: 0.4507, Adjusted R-squared: 0.4459
F-statistic: 93.53 on 2 and 228 DF, p-value: < 2.2e-16
> plot(model2) #show the diagnostics on the screen
> anova(model1,model2) #compare the difference between the two models
Analysis of Variance Table
Model 1: bdi ~ epiNeur
Model 2: bdi ~ epiNeur + traitanx
Res.Df RSS Df Sum of Sq F Pr(>F)
1 229 5103.3
2 228 4214.3 1 889.1 48.101 4.133e-11 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Model 2.5 specifies the additive and interactive effects of Neuroticsm and Trait Anxiety. However, because we have not 0 centered N and TA, the main effects are incorrect, although the interaction term and overall fit is correct.
By using the scale () function we are able to correctly estimate the main effect terms. Note that models 3 and 4 are the exact same model, but specied differently.
model2.5=lm(bdi~epiNeur*traitanx) #test for the interaction, note that the main effects are incorrect
summary(model2.5) #because we need to 0 center the data
anova(model2,model2.5) #compare the two models
#rescale the data to do the analysis
cneur=scale(epiNeur,scale=F) #0 center epiNeur
zneur=scale(epiNeur,scale=T) #standardize epiNeur
ctrait = scale(traitanx,scale=F) #0 center traitAnx
model3=lm(bdi~cneur+ctrait+cneur*ctrait)
summary(model3) #explicitly list the additive and interactive terms
plot(model3)
model4=lm(bdi~cneur*ctrait)
summary(model4) #note how this is exactly the same as the previous model
> model2.5=lm(bdi~epiNeur*traitanx) #test for the interaction, note that the main effects are incorrect
> summary(model2.5) #because we need to 0 center the data
Call:
lm(formula = bdi ~ epiNeur * traitanx)
Residuals:
Min 1Q Median 3Q Max
-12.353 -2.625 -0.663 1.930 13.496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.520422 2.751136 -1.643 0.1017
epiNeur -0.012794 0.227024 -0.056 0.9551
traitanx 0.210860 0.083505 2.525 0.0122 *
epiNeur:traitanx 0.007290 0.005736 1.271 0.2051
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546, Adjusted R-squared: 0.4474
F-statistic: 63.06 on 3 and 227 DF, p-value: < 2.2e-16
> anova(model2,model2.5) #compare the two models
Analysis of Variance Table
Model 1: bdi ~ epiNeur + traitanx
Model 2: bdi ~ epiNeur * traitanx
Res.Df RSS Df Sum of Sq F Pr(>F)
1 228 4214.3
2 227 4184.5 1 29.8 1.615 0.2051
>
> #rescale the data to do the analysis
> cneur=scale(epiNeur,scale=F) #0 center epiNeur
> zneur=scale(epiNeur,scale=T) #standardize epiNeur
> ctrait = scale(traitanx,scale=F) #0 center traitAnx
>
> model3=lm(bdi~cneur+ctrait+cneur*ctrait)
> summary(model3) #explicitly list the additive and interactive terms
Call:
lm(formula = bdi ~ cneur + ctrait + cneur * ctrait)
Residuals:
Min 1Q Median 3Q Max
-12.353 -2.625 -0.663 1.930 13.496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.532458 0.342787 19.057 < 2e-16 ***
cneur 0.271581 0.085363 3.181 0.00167 **
ctrait 0.286759 0.044940 6.381 9.78e-10 ***
cneur:ctrait 0.007290 0.005736 1.271 0.20509
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546, Adjusted R-squared: 0.4474
F-statistic: 63.06 on 3 and 227 DF, p-value: < 2.2e-16
> plot(model3)
> model4=lm(bdi~cneur*ctrait)
> summary(model4) #note how this is exactly the same as the previous model
Call:
lm(formula = bdi ~ cneur * ctrait)
Residuals:
Min 1Q Median 3Q Max
-12.353 -2.625 -0.663 1.930 13.496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.532458 0.342787 19.057 < 2e-16 ***
cneur 0.271581 0.085363 3.181 0.00167 **
ctrait 0.286759 0.044940 6.381 9.78e-10 ***
cneur:ctrait 0.007290 0.005736 1.271 0.20509
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546, Adjusted R-squared: 0.4474
F-statistic: 63.06 on 3 and 227 DF, p-value: < 2.2e-16
For the next set of fits, we standardize all of the EPI scales and the BFI scales, and then use the combined sets to fit the data. (This is probably overfitting, and is done as an example what can be done, not best practices.)
epi=cbind(epiS,epiImp,epilie,epiNeur) #form a new variable "epi" without overall extraversion
epi=as.matrix(epi) #actually, more useful to treat this variable as a matrix
bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen)) #create bfi as a matrix as well
epi=scale(epi,scale=T) #standardize the epi
bfi=scale(bfi,scale=T) #standardize the bfi
model5=lm(bdi~bfi) #model beck depression by the Big 5 inventory
summary(model5)
model6=lm(bdi~bfi+epi) #
summary(model6)
model7 = lm(bdi~bfi*epi) #additive model of epi and bfi as well as the interactions between the sets
summary(model7) #given as an example of overfitting
layout(1) #reset the plotting device to do single graphs
> epi=cbind(epiS,epiImp,epilie,epiNeur) #form a new variable "epi" without overall extraversion
> epi=as.matrix(epi) #actually, more useful to treat this variable as a matrix
>
> bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen)) #create bfi as a matrix as well
> epi=scale(epi,scale=T) #standardize the epi
> bfi=scale(bfi,scale=T) #standardize the bfi
>
> model5=lm(bdi~bfi) #model beck depression by the Big 5 inventory
> summary(model5)
Call:
lm(formula = bdi ~ bfi)
Residuals:
Min 1Q Median 3Q Max
-10.2635 -3.1809 -0.6355 2.1409 19.4566
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.7792 0.3239 20.933 < 2e-16 ***
bfibfext -0.3519 0.3929 -0.896 0.3713
bfibfneur 3.0576 0.3459 8.840 2.81e-16 ***
bfibfagree 0.2932 0.4094 0.716 0.4746
bfibfcon -0.8968 0.3680 -2.437 0.0156 *
bfibfopen -1.0197 0.4013 -2.541 0.0117 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.922 on 225 degrees of freedom
Multiple R-Squared: 0.2894, Adjusted R-squared: 0.2736
F-statistic: 18.33 on 5 and 225 DF, p-value: 2.975e-15
> model6=lm(bdi~bfi+epi) #
> summary(model6)
Call:
lm(formula = bdi ~ bfi + epi)
Residuals:
Min 1Q Median 3Q Max
-11.7001 -2.7939 -0.6687 2.2363 16.3576
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.77922 0.30055 22.556 < 2e-16 ***
bfibfext 0.32262 0.45645 0.707 0.48044
bfibfneur 1.35905 0.41935 3.241 0.00138 **
bfibfagree 0.06383 0.38805 0.164 0.86950
bfibfcon -0.66832 0.37153 -1.799 0.07341 .
bfibfopen -1.00298 0.37404 -2.681 0.00788 **
epiepiS 0.09068 0.39339 0.231 0.81791
epiepiImp -0.62218 0.36787 -1.691 0.09219 .
epiepilie -0.26893 0.33323 -0.807 0.42050
epiepiNeur 2.45393 0.41509 5.912 1.27e-08 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.568 on 221 degrees of freedom
Multiple R-Squared: 0.3989, Adjusted R-squared: 0.3744
F-statistic: 16.3 on 9 and 221 DF, p-value: < 2.2e-16
>
> model7 = lm(bdi~bfi*epi) #additive model of epi and bfi as well as the interactions between the sets
> summary(model7) #given as an example of overfitting
Call:
lm(formula = bdi ~ bfi * epi)
Residuals:
Min 1Q Median 3Q Max
-12.0191 -2.6525 -0.5497 1.9093 14.9430
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.279425 0.423041 14.844 < 2e-16 ***
bfibfext 0.206300 0.493871 0.418 0.6766
bfibfneur 1.158978 0.451906 2.565 0.0111 *
bfibfagree 0.002001 0.417169 0.005 0.9962
bfibfcon -0.675954 0.393181 -1.719 0.0871 .
bfibfopen -0.685177 0.416177 -1.646 0.1013
epiepiS 0.235251 0.430179 0.547 0.5851
epiepiImp -0.593336 0.396046 -1.498 0.1357
epiepilie -0.228319 0.365699 -0.624 0.5331
epiepiNeur 2.655196 0.432756 6.136 4.44e-09 ***
bfibfext:epiepiS 0.236116 0.458918 0.515 0.6075
bfibfneur:epiepiS 0.319599 0.403803 0.791 0.4296
bfibfagree:epiepiS -0.263338 0.484497 -0.544 0.5874
bfibfcon:epiepiS -0.055400 0.413468 -0.134 0.8935
bfibfopen:epiepiS 0.144438 0.496949 0.291 0.7716
bfibfext:epiepiImp -0.065013 0.463085 -0.140 0.8885
bfibfneur:epiepiImp -0.150381 0.393607 -0.382 0.7028
bfibfagree:epiepiImp 0.335234 0.478316 0.701 0.4842
bfibfcon:epiepiImp 0.021982 0.394558 0.056 0.9556
bfibfopen:epiepiImp 0.355919 0.539411 0.660 0.5101
bfibfext:epiepilie -0.053018 0.424651 -0.125 0.9008
bfibfneur:epiepilie -0.088985 0.420050 -0.212 0.8324
bfibfagree:epiepilie 0.474549 0.474784 1.000 0.3188
bfibfcon:epiepilie 0.026155 0.382927 0.068 0.9456
bfibfopen:epiepilie -0.086944 0.463418 -0.188 0.8514
bfibfext:epiepiNeur 0.700558 0.460097 1.523 0.1294
bfibfneur:epiepiNeur 0.579644 0.369567 1.568 0.1184
bfibfagree:epiepiNeur -0.475656 0.442734 -1.074 0.2839
bfibfcon:epiepiNeur -0.070141 0.460483 -0.152 0.8791
bfibfopen:epiepiNeur -0.213680 0.417856 -0.511 0.6097
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.607 on 201 degrees of freedom
Multiple R-Squared: 0.444, Adjusted R-squared: 0.3638
F-statistic: 5.534 on 29 and 201 DF, p-value: 5.879e-14