Linear regression

Many statistics used by psychologists and social scientists are special cases of the linear model. Generalizations of the linear model include an even wider range of statistical models.

Consider the following models:

  • y~x or y~1+x are both examples of simple linear regression with an implicit or explicit intercept.
  • y~0+x or y~ -1 +x or y~ x-1 linear regression through the origin
  • y ~ A where A is a matrix of categorical factors is a classic ANOVA model.
  • y ~ A + x is ANOVA with x as a covariate
  • y ~A*B or y~ A + B + A*B ANOVA with interaction terms
  • ...

These models can be fitted with the linear model function (lm) and then various summary statistics are available of the fit. The data set is our familiar set of Eysenck Personality Inventory and Big Five Inventory scales with Beck Depression and state and trait anxiety scales as well. The first analysis just regresses BDI on EPI Neuroticism, then we add in Trait Anxiety, and then the N * trait Anx interaction. Note that we need to 0 center the predictors when we have an interaction term if we expect to interpret the additive effects correctly. Centering is done with the scale () function. Graphical summaries of the regession show four plots: residuals as a function of the fitted values, standard errors of the residuals, a plot of the residuals versus a normal distribution, and finally, a plot of the leverage of subjects to determine outliers. Models 5 and 6 predict bdi using the BFI, and model 7 (for too much fitting) looks at the epi and bfi and the interactions of their scales. What follows are the commands for a number of demonstrations. Samples of the commands and the output may be found in the regression page.

Get the data, list the variables in the data set.

datafilename="http://personality-project.org/revelle/R/datasets/maps.mixx.epi.bfi.data"  #where are the data
data =read.table(datafilename,header=TRUE)  #read the data file
names(data)     #what variables are in the data set?
attach(data)    #make the variables easier to use

Produces this output:

datafilename="http://personality-project.org/revelle/R/datasets/maps.mixx.epi.bfi.data"  #where are the data
> data =read.table(datafilename,header=TRUE)  #read the data file
> names(data)     #what variables are in the data set?
 [1] "epiE"     "epiS"     "epiImp"   "epilie"   "epiNeur"  "bfagree" 
 [7] "bfcon"    "bfext"    "bfneur"   "bfopen"   "bdi"      "traitanx"
[13] "stateanx"

attach(data)    #make the variables easier to use

First show the zero order correlation scatter plot, then do the first regression and plot the diagnostics.

layout(1)    #a one panel graph
plot(epiNeur,bdi)    #x, y plot
lines(lowess(epiNeur,bdi))   #fit with lowess function

Do the regression

model1 = lm(bdi~epiNeur)  #simple regression of  beck depression on Neuroticism
summary(model1)   # basic statistical summary

                  #pass parameters to the graphics device
par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
               pty = "s")       # square plotting region,
                                # independent of device size
                   
    
plot(model1)     #diagnostic plots in the graphics window

> model1 = lm(bdi~epiNeur)  #simple regression of  beck depression on Neuroticism
> summary(model1)   # basic statistical summary

Call:
lm(formula = bdi ~ epiNeur)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.9548  -3.1577  -0.7707   2.0452  16.4092 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.32129    0.73070   -0.44     0.66    
epiNeur      0.68200    0.06353   10.74   <2e-16 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 4.721 on 229 degrees of freedom
Multiple R-Squared: 0.3348,	Adjusted R-squared: 0.3319 
F-statistic: 115.3 on 1 and 229 DF,  p-value: < 2.2e-16 

> 
>                   #pass parameters to the graphics device
> op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
+                pty = "s")       # square plotting region,
>                                 # independent of device size
> 
>     
> plot(model1)     #diagnostic plots in the graphics window

Model 2 adds in trait anxiety.

model2=lm(bdi~epiNeur+traitanx)    #add in trait anxiety
summary(model2)    #basic output
plot(model2)
anova(model1,model2)             #compare the difference between the two models

Gives us:

> model2=lm(bdi~epiNeur+traitanx)    #add in trait anxiety
> summary(model2)    #basic output


Call:
lm(formula = bdi ~ epiNeur + traitanx)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0288  -2.6848  -0.5121   1.9823  13.3081 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.63779    1.24729  -6.124 3.97e-09 ***
epiNeur      0.25506    0.08448   3.019  0.00282 ** 
traitanx     0.30151    0.04347   6.935 4.13e-11 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 4.299 on 228 degrees of freedom
Multiple R-Squared: 0.4507,	Adjusted R-squared: 0.4459 
F-statistic: 93.53 on 2 and 228 DF,  p-value: < 2.2e-16 

> plot(model2)                    #show the diagnostics on the screen
> anova(model1,model2)             #compare the difference between the two models
Analysis of Variance Table

Model 1: bdi ~ epiNeur
Model 2: bdi ~ epiNeur + traitanx
  Res.Df    RSS  Df Sum of Sq      F    Pr(>F)    
1    229 5103.3                                   
2    228 4214.3   1     889.1 48.101 4.133e-11 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Model 2.5 specifies the additive and interactive effects of Neuroticsm and Trait Anxiety. However, because we have not 0 centered N and TA, the main effects are incorrect, although the interaction term and overall fit is correct.

By using the scale () function we are able to correctly estimate the main effect terms. Note that models 3 and 4 are the exact same model, but specied differently.

model2.5=lm(bdi~epiNeur*traitanx)  #test for the interaction, note that the main effects are incorrect
summary(model2.5)               #because we need to 0 center the data
anova(model2,model2.5)          #compare the two models 

                                #rescale the data to do the analysis  
cneur=scale(epiNeur,scale=F)    #0 center epiNeur
zneur=scale(epiNeur,scale=T)     #standardize epiNeur
ctrait = scale(traitanx,scale=F) #0 center traitAnx

model3=lm(bdi~cneur+ctrait+cneur*ctrait)
summary(model3)                    #explicitly list the additive and interactive terms
plot(model3)
model4=lm(bdi~cneur*ctrait)
summary(model4)                    #note how this is exactly the same as the previous model

> model2.5=lm(bdi~epiNeur*traitanx)  #test for the interaction, note that the main effects are incorrect
> summary(model2.5)               #because we need to 0 center the data

Call:
lm(formula = bdi ~ epiNeur * traitanx)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.353  -2.625  -0.663   1.930  13.496 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -4.520422   2.751136  -1.643   0.1017  
epiNeur          -0.012794   0.227024  -0.056   0.9551  
traitanx          0.210860   0.083505   2.525   0.0122 *
epiNeur:traitanx  0.007290   0.005736   1.271   0.2051  
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546,	Adjusted R-squared: 0.4474 
F-statistic: 63.06 on 3 and 227 DF,  p-value: < 2.2e-16 

> anova(model2,model2.5)          #compare the two models 
Analysis of Variance Table

Model 1: bdi ~ epiNeur + traitanx
Model 2: bdi ~ epiNeur * traitanx
  Res.Df    RSS  Df Sum of Sq     F Pr(>F)
1    228 4214.3                           
2    227 4184.5   1      29.8 1.615 0.2051
> 
>                                 #rescale the data to do the analysis  
> cneur=scale(epiNeur,scale=F)    #0 center epiNeur
> zneur=scale(epiNeur,scale=T)     #standardize epiNeur
> ctrait = scale(traitanx,scale=F) #0 center traitAnx
> 
> model3=lm(bdi~cneur+ctrait+cneur*ctrait)
> summary(model3)                    #explicitly list the additive and interactive terms

Call:
lm(formula = bdi ~ cneur + ctrait + cneur * ctrait)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.353  -2.625  -0.663   1.930  13.496 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.532458   0.342787  19.057  < 2e-16 ***
cneur        0.271581   0.085363   3.181  0.00167 ** 
ctrait       0.286759   0.044940   6.381 9.78e-10 ***
cneur:ctrait 0.007290   0.005736   1.271  0.20509    
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546,	Adjusted R-squared: 0.4474 
F-statistic: 63.06 on 3 and 227 DF,  p-value: < 2.2e-16 

> plot(model3)
> model4=lm(bdi~cneur*ctrait)
> summary(model4)                    #note how this is exactly the same as the previous model

Call:
lm(formula = bdi ~ cneur * ctrait)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.353  -2.625  -0.663   1.930  13.496 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.532458   0.342787  19.057  < 2e-16 ***
cneur        0.271581   0.085363   3.181  0.00167 ** 
ctrait       0.286759   0.044940   6.381 9.78e-10 ***
cneur:ctrait 0.007290   0.005736   1.271  0.20509    
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 4.293 on 227 degrees of freedom
Multiple R-Squared: 0.4546,	Adjusted R-squared: 0.4474 
F-statistic: 63.06 on 3 and 227 DF,  p-value: < 2.2e-16 

For the next set of fits, we standardize all of the EPI scales and the BFI scales, and then use the combined sets to fit the data. (This is probably overfitting, and is done as an example what can be done, not best practices.)

epi=cbind(epiS,epiImp,epilie,epiNeur)       #form a new variable "epi" without overall extraversion
epi=as.matrix(epi)      #actually, more useful to treat this variable as a matrix

bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen))    #create bfi as a matrix as well
epi=scale(epi,scale=T)   #standardize the epi
bfi=scale(bfi,scale=T)   #standardize the bfi

model5=lm(bdi~bfi)                 #model beck depression by the Big 5 inventory
summary(model5)
model6=lm(bdi~bfi+epi)  		#
summary(model6)

model7 = lm(bdi~bfi*epi)          #additive model of epi and bfi as well as the interactions between the sets
summary(model7)                   #given as an example of overfitting

layout(1)                 #reset the plotting device to do single graphs


> epi=cbind(epiS,epiImp,epilie,epiNeur)       #form a new variable "epi" without overall extraversion
> epi=as.matrix(epi)      #actually, more useful to treat this variable as a matrix
> 
> bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen))    #create bfi as a matrix as well
> epi=scale(epi,scale=T)   #standardize the epi
> bfi=scale(bfi,scale=T)   #standardize the bfi
> 
> model5=lm(bdi~bfi)                 #model beck depression by the Big 5 inventory
> summary(model5)

Call:
lm(formula = bdi ~ bfi)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2635  -3.1809  -0.6355   2.1409  19.4566 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.7792     0.3239  20.933  < 2e-16 ***
bfibfext     -0.3519     0.3929  -0.896   0.3713    
bfibfneur     3.0576     0.3459   8.840 2.81e-16 ***
bfibfagree    0.2932     0.4094   0.716   0.4746    
bfibfcon     -0.8968     0.3680  -2.437   0.0156 *  
bfibfopen    -1.0197     0.4013  -2.541   0.0117 *  
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 4.922 on 225 degrees of freedom
Multiple R-Squared: 0.2894,	Adjusted R-squared: 0.2736 
F-statistic: 18.33 on 5 and 225 DF,  p-value: 2.975e-15 

> model6=lm(bdi~bfi+epi)  		#
> summary(model6)

Call:
lm(formula = bdi ~ bfi + epi)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.7001  -2.7939  -0.6687   2.2363  16.3576 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.77922    0.30055  22.556  < 2e-16 ***
bfibfext     0.32262    0.45645   0.707  0.48044    
bfibfneur    1.35905    0.41935   3.241  0.00138 ** 
bfibfagree   0.06383    0.38805   0.164  0.86950    
bfibfcon    -0.66832    0.37153  -1.799  0.07341 .  
bfibfopen   -1.00298    0.37404  -2.681  0.00788 ** 
epiepiS      0.09068    0.39339   0.231  0.81791    
epiepiImp   -0.62218    0.36787  -1.691  0.09219 .  
epiepilie   -0.26893    0.33323  -0.807  0.42050    
epiepiNeur   2.45393    0.41509   5.912 1.27e-08 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 4.568 on 221 degrees of freedom
Multiple R-Squared: 0.3989,	Adjusted R-squared: 0.3744 
F-statistic:  16.3 on 9 and 221 DF,  p-value: < 2.2e-16 

> 
> model7 = lm(bdi~bfi*epi)          #additive model of epi and bfi as well as the interactions between the sets
> summary(model7)                   #given as an example of overfitting

Call:
lm(formula = bdi ~ bfi * epi)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0191  -2.6525  -0.5497   1.9093  14.9430 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.279425   0.423041  14.844  < 2e-16 ***
bfibfext               0.206300   0.493871   0.418   0.6766    
bfibfneur              1.158978   0.451906   2.565   0.0111 *  
bfibfagree             0.002001   0.417169   0.005   0.9962    
bfibfcon              -0.675954   0.393181  -1.719   0.0871 .  
bfibfopen             -0.685177   0.416177  -1.646   0.1013    
epiepiS                0.235251   0.430179   0.547   0.5851    
epiepiImp             -0.593336   0.396046  -1.498   0.1357    
epiepilie             -0.228319   0.365699  -0.624   0.5331    
epiepiNeur             2.655196   0.432756   6.136 4.44e-09 ***
bfibfext:epiepiS       0.236116   0.458918   0.515   0.6075    
bfibfneur:epiepiS      0.319599   0.403803   0.791   0.4296    
bfibfagree:epiepiS    -0.263338   0.484497  -0.544   0.5874    
bfibfcon:epiepiS      -0.055400   0.413468  -0.134   0.8935    
bfibfopen:epiepiS      0.144438   0.496949   0.291   0.7716    
bfibfext:epiepiImp    -0.065013   0.463085  -0.140   0.8885    
bfibfneur:epiepiImp   -0.150381   0.393607  -0.382   0.7028    
bfibfagree:epiepiImp   0.335234   0.478316   0.701   0.4842    
bfibfcon:epiepiImp     0.021982   0.394558   0.056   0.9556    
bfibfopen:epiepiImp    0.355919   0.539411   0.660   0.5101    
bfibfext:epiepilie    -0.053018   0.424651  -0.125   0.9008    
bfibfneur:epiepilie   -0.088985   0.420050  -0.212   0.8324    
bfibfagree:epiepilie   0.474549   0.474784   1.000   0.3188    
bfibfcon:epiepilie     0.026155   0.382927   0.068   0.9456    
bfibfopen:epiepilie   -0.086944   0.463418  -0.188   0.8514    
bfibfext:epiepiNeur    0.700558   0.460097   1.523   0.1294    
bfibfneur:epiepiNeur   0.579644   0.369567   1.568   0.1184    
bfibfagree:epiepiNeur -0.475656   0.442734  -1.074   0.2839    
bfibfcon:epiepiNeur   -0.070141   0.460483  -0.152   0.8791    
bfibfopen:epiepiNeur  -0.213680   0.417856  -0.511   0.6097    
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 4.607 on 201 degrees of freedom
Multiple R-Squared: 0.444,	Adjusted R-squared: 0.3638 
F-statistic: 5.534 on 29 and 201 DF,  p-value: 5.879e-14