## Linear regression

Many statistics used by psychologists and social scientists are special cases of the linear model. Generalizations of the linear model include an even wider range of statistical models.

Consider the following models:

• y~x or y~1+x are both examples of simple linear regression with an implicit or explicit intercept.
• y~0+x or y~ -1 +x or y~ x-1 linear regression through the origin
• y ~ A where A is a matrix of categorical factors is a classic ANOVA model.
• y ~ A + x is ANOVA with x as a covariate
• y ~A*B or y~ A + B + A*B ANOVA with interaction terms
• ...

These models can be fitted with the linear model function (lm) and then various summary statistics are available of the fit. The data set is our familiar set of Eysenck Personality Inventory and Big Five Inventory scales with Beck Depression and state and trait anxiety scales as well. The first analysis just regresses BDI on EPI Neuroticism, then we add in Trait Anxiety, and then the N * trait Anx interaction. Note that we need to 0 center the predictors when we have an interaction term if we expect to interpret the additive effects correctly. Centering is done with the scale () function. Graphical summaries of the regession show four plots: residuals as a function of the fitted values, standard errors of the residuals, a plot of the residuals versus a normal distribution, and finally, a plot of the leverage of subjects to determine outliers. Models 5 and 6 predict bdi using the BFI, and model 7 (for too much fitting) looks at the epi and bfi and the interactions of their scales. What follows are the commands for a number of demonstrations. Samples of the commands and the output may be found in the regression page.

Get the data, list the variables in the data set.

```datafilename="http://personality-project.org/revelle/R/datasets/maps.mixx.epi.bfi.data"  #where are the data
names(data)     #what variables are in the data set?
attach(data)    #make the variables easier to use
```

Produces this output:

```datafilename="http://personality-project.org/revelle/R/datasets/maps.mixx.epi.bfi.data"  #where are the data
> names(data)     #what variables are in the data set?
[1] "epiE"     "epiS"     "epiImp"   "epilie"   "epiNeur"  "bfagree"
[7] "bfcon"    "bfext"    "bfneur"   "bfopen"   "bdi"      "traitanx"
[13] "stateanx"

attach(data)    #make the variables easier to use
```

First show the zero order correlation scatter plot, then do the first regression and plot the diagnostics.

```layout(1)    #a one panel graph
plot(epiNeur,bdi)    #x, y plot
lines(lowess(epiNeur,bdi))   #fit with lowess function
```

Do the regression

```model1 = lm(bdi~epiNeur)  #simple regression of  beck depression on Neuroticism
summary(model1)   # basic statistical summary

#pass parameters to the graphics device
par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
pty = "s")       # square plotting region,
# independent of device size

plot(model1)     #diagnostic plots in the graphics window

> model1 = lm(bdi~epiNeur)  #simple regression of  beck depression on Neuroticism
> summary(model1)   # basic statistical summary

Call:
lm(formula = bdi ~ epiNeur)

Residuals:
Min       1Q   Median       3Q      Max
-11.9548  -3.1577  -0.7707   2.0452  16.4092

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.32129    0.73070   -0.44     0.66
epiNeur      0.68200    0.06353   10.74   <2e-16 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 4.721 on 229 degrees of freedom
Multiple R-Squared: 0.3348,	Adjusted R-squared: 0.3319
F-statistic: 115.3 on 1 and 229 DF,  p-value: < 2.2e-16

>
>                   #pass parameters to the graphics device
> op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
+                pty = "s")       # square plotting region,
>                                 # independent of device size
>
>
> plot(model1)     #diagnostic plots in the graphics window
```

Model 2 adds in trait anxiety.

```model2=lm(bdi~epiNeur+traitanx)    #add in trait anxiety
summary(model2)    #basic output
plot(model2)
anova(model1,model2)             #compare the difference between the two models
```

Gives us:

```> model2=lm(bdi~epiNeur+traitanx)    #add in trait anxiety
> summary(model2)    #basic output

Call:
lm(formula = bdi ~ epiNeur + traitanx)

Residuals:
Min       1Q   Median       3Q      Max
-12.0288  -2.6848  -0.5121   1.9823  13.3081

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.63779    1.24729  -6.124 3.97e-09 ***
epiNeur      0.25506    0.08448   3.019  0.00282 **
traitanx     0.30151    0.04347   6.935 4.13e-11 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 4.299 on 228 degrees of freedom
Multiple R-Squared: 0.4507,	Adjusted R-squared: 0.4459
F-statistic: 93.53 on 2 and 228 DF,  p-value: < 2.2e-16

> plot(model2)                    #show the diagnostics on the screen
> anova(model1,model2)             #compare the difference between the two models
Analysis of Variance Table

Model 1: bdi ~ epiNeur
Model 2: bdi ~ epiNeur + traitanx
Res.Df    RSS  Df Sum of Sq      F    Pr(>F)
1    229 5103.3
2    228 4214.3   1     889.1 48.101 4.133e-11 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
```

Model 2.5 specifies the additive and interactive effects of Neuroticsm and Trait Anxiety. However, because we have not 0 centered N and TA, the main effects are incorrect, although the interaction term and overall fit is correct.

By using the scale () function we are able to correctly estimate the main effect terms. Note that models 3 and 4 are the exact same model, but specied differently.

```model2.5=lm(bdi~epiNeur*traitanx)  #test for the interaction, note that the main effects are incorrect
summary(model2.5)               #because we need to 0 center the data
anova(model2,model2.5)          #compare the two models

#rescale the data to do the analysis
cneur=scale(epiNeur,scale=F)    #0 center epiNeur
zneur=scale(epiNeur,scale=T)     #standardize epiNeur
ctrait = scale(traitanx,scale=F) #0 center traitAnx

model3=lm(bdi~cneur+ctrait+cneur*ctrait)
summary(model3)                    #explicitly list the additive and interactive terms
plot(model3)
model4=lm(bdi~cneur*ctrait)
summary(model4)                    #note how this is exactly the same as the previous model
```

```> model2.5=lm(bdi~epiNeur*traitanx)  #test for the interaction, note that the main effects are incorrect
> summary(model2.5)               #because we need to 0 center the data

Call:
lm(formula = bdi ~ epiNeur * traitanx)

Residuals:
Min      1Q  Median      3Q     Max
-12.353  -2.625  -0.663   1.930  13.496

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      -4.520422   2.751136  -1.643   0.1017
epiNeur          -0.012794   0.227024  -0.056   0.9551
traitanx          0.210860   0.083505   2.525   0.0122 *
epiNeur:traitanx  0.007290   0.005736   1.271   0.2051
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546,	Adjusted R-squared: 0.4474
F-statistic: 63.06 on 3 and 227 DF,  p-value: < 2.2e-16

> anova(model2,model2.5)          #compare the two models
Analysis of Variance Table

Model 1: bdi ~ epiNeur + traitanx
Model 2: bdi ~ epiNeur * traitanx
Res.Df    RSS  Df Sum of Sq     F Pr(>F)
1    228 4214.3
2    227 4184.5   1      29.8 1.615 0.2051
>
>                                 #rescale the data to do the analysis
> cneur=scale(epiNeur,scale=F)    #0 center epiNeur
> zneur=scale(epiNeur,scale=T)     #standardize epiNeur
> ctrait = scale(traitanx,scale=F) #0 center traitAnx
>
> model3=lm(bdi~cneur+ctrait+cneur*ctrait)
> summary(model3)                    #explicitly list the additive and interactive terms

Call:
lm(formula = bdi ~ cneur + ctrait + cneur * ctrait)

Residuals:
Min      1Q  Median      3Q     Max
-12.353  -2.625  -0.663   1.930  13.496

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.532458   0.342787  19.057  < 2e-16 ***
cneur        0.271581   0.085363   3.181  0.00167 **
ctrait       0.286759   0.044940   6.381 9.78e-10 ***
cneur:ctrait 0.007290   0.005736   1.271  0.20509
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546,	Adjusted R-squared: 0.4474
F-statistic: 63.06 on 3 and 227 DF,  p-value: < 2.2e-16

> plot(model3)
> model4=lm(bdi~cneur*ctrait)
> summary(model4)                    #note how this is exactly the same as the previous model

Call:
lm(formula = bdi ~ cneur * ctrait)

Residuals:
Min      1Q  Median      3Q     Max
-12.353  -2.625  -0.663   1.930  13.496

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.532458   0.342787  19.057  < 2e-16 ***
cneur        0.271581   0.085363   3.181  0.00167 **
ctrait       0.286759   0.044940   6.381 9.78e-10 ***
cneur:ctrait 0.007290   0.005736   1.271  0.20509
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546,	Adjusted R-squared: 0.4474
F-statistic: 63.06 on 3 and 227 DF,  p-value: < 2.2e-16
```

For the next set of fits, we standardize all of the EPI scales and the BFI scales, and then use the combined sets to fit the data. (This is probably overfitting, and is done as an example what can be done, not best practices.)

```epi=cbind(epiS,epiImp,epilie,epiNeur)       #form a new variable "epi" without overall extraversion
epi=as.matrix(epi)      #actually, more useful to treat this variable as a matrix

bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen))    #create bfi as a matrix as well
epi=scale(epi,scale=T)   #standardize the epi
bfi=scale(bfi,scale=T)   #standardize the bfi

model5=lm(bdi~bfi)                 #model beck depression by the Big 5 inventory
summary(model5)
model6=lm(bdi~bfi+epi)  		#
summary(model6)

model7 = lm(bdi~bfi*epi)          #additive model of epi and bfi as well as the interactions between the sets
summary(model7)                   #given as an example of overfitting

layout(1)                 #reset the plotting device to do single graphs

> epi=cbind(epiS,epiImp,epilie,epiNeur)       #form a new variable "epi" without overall extraversion
> epi=as.matrix(epi)      #actually, more useful to treat this variable as a matrix
>
> bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen))    #create bfi as a matrix as well
> epi=scale(epi,scale=T)   #standardize the epi
> bfi=scale(bfi,scale=T)   #standardize the bfi
>
> model5=lm(bdi~bfi)                 #model beck depression by the Big 5 inventory
> summary(model5)

Call:
lm(formula = bdi ~ bfi)

Residuals:
Min       1Q   Median       3Q      Max
-10.2635  -3.1809  -0.6355   2.1409  19.4566

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   6.7792     0.3239  20.933  < 2e-16 ***
bfibfext     -0.3519     0.3929  -0.896   0.3713
bfibfneur     3.0576     0.3459   8.840 2.81e-16 ***
bfibfagree    0.2932     0.4094   0.716   0.4746
bfibfcon     -0.8968     0.3680  -2.437   0.0156 *
bfibfopen    -1.0197     0.4013  -2.541   0.0117 *
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 4.922 on 225 degrees of freedom
Multiple R-Squared: 0.2894,	Adjusted R-squared: 0.2736
F-statistic: 18.33 on 5 and 225 DF,  p-value: 2.975e-15

> model6=lm(bdi~bfi+epi)  		#
> summary(model6)

Call:
lm(formula = bdi ~ bfi + epi)

Residuals:
Min       1Q   Median       3Q      Max
-11.7001  -2.7939  -0.6687   2.2363  16.3576

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.77922    0.30055  22.556  < 2e-16 ***
bfibfext     0.32262    0.45645   0.707  0.48044
bfibfneur    1.35905    0.41935   3.241  0.00138 **
bfibfagree   0.06383    0.38805   0.164  0.86950
bfibfcon    -0.66832    0.37153  -1.799  0.07341 .
bfibfopen   -1.00298    0.37404  -2.681  0.00788 **
epiepiS      0.09068    0.39339   0.231  0.81791
epiepiImp   -0.62218    0.36787  -1.691  0.09219 .
epiepilie   -0.26893    0.33323  -0.807  0.42050
epiepiNeur   2.45393    0.41509   5.912 1.27e-08 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 4.568 on 221 degrees of freedom
Multiple R-Squared: 0.3989,	Adjusted R-squared: 0.3744
F-statistic:  16.3 on 9 and 221 DF,  p-value: < 2.2e-16

>
> model7 = lm(bdi~bfi*epi)          #additive model of epi and bfi as well as the interactions between the sets
> summary(model7)                   #given as an example of overfitting

Call:
lm(formula = bdi ~ bfi * epi)

Residuals:
Min       1Q   Median       3Q      Max
-12.0191  -2.6525  -0.5497   1.9093  14.9430

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)            6.279425   0.423041  14.844  < 2e-16 ***
bfibfext               0.206300   0.493871   0.418   0.6766
bfibfneur              1.158978   0.451906   2.565   0.0111 *
bfibfagree             0.002001   0.417169   0.005   0.9962
bfibfcon              -0.675954   0.393181  -1.719   0.0871 .
bfibfopen             -0.685177   0.416177  -1.646   0.1013
epiepiS                0.235251   0.430179   0.547   0.5851
epiepiImp             -0.593336   0.396046  -1.498   0.1357
epiepilie             -0.228319   0.365699  -0.624   0.5331
epiepiNeur             2.655196   0.432756   6.136 4.44e-09 ***
bfibfext:epiepiS       0.236116   0.458918   0.515   0.6075
bfibfneur:epiepiS      0.319599   0.403803   0.791   0.4296
bfibfagree:epiepiS    -0.263338   0.484497  -0.544   0.5874
bfibfcon:epiepiS      -0.055400   0.413468  -0.134   0.8935
bfibfopen:epiepiS      0.144438   0.496949   0.291   0.7716
bfibfext:epiepiImp    -0.065013   0.463085  -0.140   0.8885
bfibfneur:epiepiImp   -0.150381   0.393607  -0.382   0.7028
bfibfagree:epiepiImp   0.335234   0.478316   0.701   0.4842
bfibfcon:epiepiImp     0.021982   0.394558   0.056   0.9556
bfibfopen:epiepiImp    0.355919   0.539411   0.660   0.5101
bfibfext:epiepilie    -0.053018   0.424651  -0.125   0.9008
bfibfneur:epiepilie   -0.088985   0.420050  -0.212   0.8324
bfibfagree:epiepilie   0.474549   0.474784   1.000   0.3188
bfibfcon:epiepilie     0.026155   0.382927   0.068   0.9456
bfibfopen:epiepilie   -0.086944   0.463418  -0.188   0.8514
bfibfext:epiepiNeur    0.700558   0.460097   1.523   0.1294
bfibfneur:epiepiNeur   0.579644   0.369567   1.568   0.1184
bfibfagree:epiepiNeur -0.475656   0.442734  -1.074   0.2839
bfibfcon:epiepiNeur   -0.070141   0.460483  -0.152   0.8791
bfibfopen:epiepiNeur  -0.213680   0.417856  -0.511   0.6097
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 4.607 on 201 degrees of freedom
Multiple R-Squared: 0.444,	Adjusted R-squared: 0.3638
F-statistic: 5.534 on 29 and 201 DF,  p-value: 5.879e-14
```