Linear regression
Many statistics used by psychologists and social scientists are special cases of the linear model. Generalizations of the linear model include an even wider range of statistical models.
Consider the following models:
- y~x or y~1+x are both examples of simple linear regression with an implicit or explicit intercept.
- y~0+x or y~ -1 +x or y~ x-1 linear regression through the origin
- y ~ A where A is a matrix of categorical factors is a classic ANOVA model.
- y ~ A + x is ANOVA with x as a covariate
- y ~A*B or y~ A + B + A*B ANOVA with interaction terms
- ...
These models can be fitted with the linear model function (lm) and then various summary statistics are available of the fit. The data set is our familiar set of Eysenck Personality Inventory and Big Five Inventory scales with Beck Depression and state and trait anxiety scales as well. The first analysis just regresses BDI on EPI Neuroticism, then we add in Trait Anxiety, and then the N * trait Anx interaction. Note that we need to 0 center the predictors when we have an interaction term if we expect to interpret the additive effects correctly. Centering is done with the scale () function. Graphical summaries of the regession show four plots: residuals as a function of the fitted values, standard errors of the residuals, a plot of the residuals versus a normal distribution, and finally, a plot of the leverage of subjects to determine outliers. Models 5 and 6 predict bdi using the BFI, and model 7 (for too much fitting) looks at the epi and bfi and the interactions of their scales. What follows are the commands for a number of demonstrations. Samples of the commands and the output may be found in the regression page.
Get the data, list the variables in the data set.
datafilename="http://personality-project.org/revelle/R/datasets/maps.mixx.epi.bfi.data" #where are the data data =read.table(datafilename,header=TRUE) #read the data file names(data) #what variables are in the data set? attach(data) #make the variables easier to use
Produces this output:
datafilename="http://personality-project.org/revelle/R/datasets/maps.mixx.epi.bfi.data" #where are the data > data =read.table(datafilename,header=TRUE) #read the data file > names(data) #what variables are in the data set? [1] "epiE" "epiS" "epiImp" "epilie" "epiNeur" "bfagree" [7] "bfcon" "bfext" "bfneur" "bfopen" "bdi" "traitanx" [13] "stateanx" attach(data) #make the variables easier to use
First show the zero order correlation scatter plot, then do the first regression and plot the diagnostics.
layout(1) #a one panel graph plot(epiNeur,bdi) #x, y plot lines(lowess(epiNeur,bdi)) #fit with lowess function

Do the regression
model1 = lm(bdi~epiNeur) #simple regression of beck depression on Neuroticism summary(model1) # basic statistical summary #pass parameters to the graphics device par(mfrow = c(2, 2), # 2 x 2 pictures on one plot pty = "s") # square plotting region, # independent of device size plot(model1) #diagnostic plots in the graphics window > model1 = lm(bdi~epiNeur) #simple regression of beck depression on Neuroticism > summary(model1) # basic statistical summary Call: lm(formula = bdi ~ epiNeur) Residuals: Min 1Q Median 3Q Max -11.9548 -3.1577 -0.7707 2.0452 16.4092 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.32129 0.73070 -0.44 0.66 epiNeur 0.68200 0.06353 10.74 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.721 on 229 degrees of freedom Multiple R-Squared: 0.3348, Adjusted R-squared: 0.3319 F-statistic: 115.3 on 1 and 229 DF, p-value: < 2.2e-16 > > #pass parameters to the graphics device > op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot + pty = "s") # square plotting region, > # independent of device size > > > plot(model1) #diagnostic plots in the graphics window

Model 2 adds in trait anxiety.
model2=lm(bdi~epiNeur+traitanx) #add in trait anxiety summary(model2) #basic output plot(model2) anova(model1,model2) #compare the difference between the two models
Gives us:
> model2=lm(bdi~epiNeur+traitanx) #add in trait anxiety > summary(model2) #basic output Call: lm(formula = bdi ~ epiNeur + traitanx) Residuals: Min 1Q Median 3Q Max -12.0288 -2.6848 -0.5121 1.9823 13.3081 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.63779 1.24729 -6.124 3.97e-09 *** epiNeur 0.25506 0.08448 3.019 0.00282 ** traitanx 0.30151 0.04347 6.935 4.13e-11 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.299 on 228 degrees of freedom Multiple R-Squared: 0.4507, Adjusted R-squared: 0.4459 F-statistic: 93.53 on 2 and 228 DF, p-value: < 2.2e-16 > plot(model2) #show the diagnostics on the screen > anova(model1,model2) #compare the difference between the two models Analysis of Variance Table Model 1: bdi ~ epiNeur Model 2: bdi ~ epiNeur + traitanx Res.Df RSS Df Sum of Sq F Pr(>F) 1 229 5103.3 2 228 4214.3 1 889.1 48.101 4.133e-11 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Model 2.5 specifies the additive and interactive effects of Neuroticsm and Trait Anxiety. However, because we have not 0 centered N and TA, the main effects are incorrect, although the interaction term and overall fit is correct.
By using the scale () function we are able to correctly estimate the main effect terms. Note that models 3 and 4 are the exact same model, but specied differently.
model2.5=lm(bdi~epiNeur*traitanx) #test for the interaction, note that the main effects are incorrect summary(model2.5) #because we need to 0 center the data anova(model2,model2.5) #compare the two models #rescale the data to do the analysis cneur=scale(epiNeur,scale=F) #0 center epiNeur zneur=scale(epiNeur,scale=T) #standardize epiNeur ctrait = scale(traitanx,scale=F) #0 center traitAnx model3=lm(bdi~cneur+ctrait+cneur*ctrait) summary(model3) #explicitly list the additive and interactive terms plot(model3) model4=lm(bdi~cneur*ctrait) summary(model4) #note how this is exactly the same as the previous model
> model2.5=lm(bdi~epiNeur*traitanx) #test for the interaction, note that the main effects are incorrect > summary(model2.5) #because we need to 0 center the data Call: lm(formula = bdi ~ epiNeur * traitanx) Residuals: Min 1Q Median 3Q Max -12.353 -2.625 -0.663 1.930 13.496 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.520422 2.751136 -1.643 0.1017 epiNeur -0.012794 0.227024 -0.056 0.9551 traitanx 0.210860 0.083505 2.525 0.0122 * epiNeur:traitanx 0.007290 0.005736 1.271 0.2051 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.293 on 227 degrees of freedom Multiple R-Squared: 0.4546, Adjusted R-squared: 0.4474 F-statistic: 63.06 on 3 and 227 DF, p-value: < 2.2e-16 > anova(model2,model2.5) #compare the two models Analysis of Variance Table Model 1: bdi ~ epiNeur + traitanx Model 2: bdi ~ epiNeur * traitanx Res.Df RSS Df Sum of Sq F Pr(>F) 1 228 4214.3 2 227 4184.5 1 29.8 1.615 0.2051 > > #rescale the data to do the analysis > cneur=scale(epiNeur,scale=F) #0 center epiNeur > zneur=scale(epiNeur,scale=T) #standardize epiNeur > ctrait = scale(traitanx,scale=F) #0 center traitAnx > > model3=lm(bdi~cneur+ctrait+cneur*ctrait) > summary(model3) #explicitly list the additive and interactive terms Call: lm(formula = bdi ~ cneur + ctrait + cneur * ctrait) Residuals: Min 1Q Median 3Q Max -12.353 -2.625 -0.663 1.930 13.496 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.532458 0.342787 19.057 < 2e-16 *** cneur 0.271581 0.085363 3.181 0.00167 ** ctrait 0.286759 0.044940 6.381 9.78e-10 *** cneur:ctrait 0.007290 0.005736 1.271 0.20509 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.293 on 227 degrees of freedom Multiple R-Squared: 0.4546, Adjusted R-squared: 0.4474 F-statistic: 63.06 on 3 and 227 DF, p-value: < 2.2e-16 > plot(model3) > model4=lm(bdi~cneur*ctrait) > summary(model4) #note how this is exactly the same as the previous model Call: lm(formula = bdi ~ cneur * ctrait) Residuals: Min 1Q Median 3Q Max -12.353 -2.625 -0.663 1.930 13.496 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.532458 0.342787 19.057 < 2e-16 *** cneur 0.271581 0.085363 3.181 0.00167 ** ctrait 0.286759 0.044940 6.381 9.78e-10 *** cneur:ctrait 0.007290 0.005736 1.271 0.20509 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.293 on 227 degrees of freedom Multiple R-Squared: 0.4546, Adjusted R-squared: 0.4474 F-statistic: 63.06 on 3 and 227 DF, p-value: < 2.2e-16

For the next set of fits, we standardize all of the EPI scales and the BFI scales, and then use the combined sets to fit the data. (This is probably overfitting, and is done as an example what can be done, not best practices.)
epi=cbind(epiS,epiImp,epilie,epiNeur) #form a new variable "epi" without overall extraversion epi=as.matrix(epi) #actually, more useful to treat this variable as a matrix bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen)) #create bfi as a matrix as well epi=scale(epi,scale=T) #standardize the epi bfi=scale(bfi,scale=T) #standardize the bfi model5=lm(bdi~bfi) #model beck depression by the Big 5 inventory summary(model5) model6=lm(bdi~bfi+epi) # summary(model6) model7 = lm(bdi~bfi*epi) #additive model of epi and bfi as well as the interactions between the sets summary(model7) #given as an example of overfitting layout(1) #reset the plotting device to do single graphs > epi=cbind(epiS,epiImp,epilie,epiNeur) #form a new variable "epi" without overall extraversion > epi=as.matrix(epi) #actually, more useful to treat this variable as a matrix > > bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen)) #create bfi as a matrix as well > epi=scale(epi,scale=T) #standardize the epi > bfi=scale(bfi,scale=T) #standardize the bfi > > model5=lm(bdi~bfi) #model beck depression by the Big 5 inventory > summary(model5) Call: lm(formula = bdi ~ bfi) Residuals: Min 1Q Median 3Q Max -10.2635 -3.1809 -0.6355 2.1409 19.4566 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.7792 0.3239 20.933 < 2e-16 *** bfibfext -0.3519 0.3929 -0.896 0.3713 bfibfneur 3.0576 0.3459 8.840 2.81e-16 *** bfibfagree 0.2932 0.4094 0.716 0.4746 bfibfcon -0.8968 0.3680 -2.437 0.0156 * bfibfopen -1.0197 0.4013 -2.541 0.0117 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.922 on 225 degrees of freedom Multiple R-Squared: 0.2894, Adjusted R-squared: 0.2736 F-statistic: 18.33 on 5 and 225 DF, p-value: 2.975e-15 > model6=lm(bdi~bfi+epi) # > summary(model6) Call: lm(formula = bdi ~ bfi + epi) Residuals: Min 1Q Median 3Q Max -11.7001 -2.7939 -0.6687 2.2363 16.3576 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.77922 0.30055 22.556 < 2e-16 *** bfibfext 0.32262 0.45645 0.707 0.48044 bfibfneur 1.35905 0.41935 3.241 0.00138 ** bfibfagree 0.06383 0.38805 0.164 0.86950 bfibfcon -0.66832 0.37153 -1.799 0.07341 . bfibfopen -1.00298 0.37404 -2.681 0.00788 ** epiepiS 0.09068 0.39339 0.231 0.81791 epiepiImp -0.62218 0.36787 -1.691 0.09219 . epiepilie -0.26893 0.33323 -0.807 0.42050 epiepiNeur 2.45393 0.41509 5.912 1.27e-08 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.568 on 221 degrees of freedom Multiple R-Squared: 0.3989, Adjusted R-squared: 0.3744 F-statistic: 16.3 on 9 and 221 DF, p-value: < 2.2e-16 > > model7 = lm(bdi~bfi*epi) #additive model of epi and bfi as well as the interactions between the sets > summary(model7) #given as an example of overfitting Call: lm(formula = bdi ~ bfi * epi) Residuals: Min 1Q Median 3Q Max -12.0191 -2.6525 -0.5497 1.9093 14.9430 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.279425 0.423041 14.844 < 2e-16 *** bfibfext 0.206300 0.493871 0.418 0.6766 bfibfneur 1.158978 0.451906 2.565 0.0111 * bfibfagree 0.002001 0.417169 0.005 0.9962 bfibfcon -0.675954 0.393181 -1.719 0.0871 . bfibfopen -0.685177 0.416177 -1.646 0.1013 epiepiS 0.235251 0.430179 0.547 0.5851 epiepiImp -0.593336 0.396046 -1.498 0.1357 epiepilie -0.228319 0.365699 -0.624 0.5331 epiepiNeur 2.655196 0.432756 6.136 4.44e-09 *** bfibfext:epiepiS 0.236116 0.458918 0.515 0.6075 bfibfneur:epiepiS 0.319599 0.403803 0.791 0.4296 bfibfagree:epiepiS -0.263338 0.484497 -0.544 0.5874 bfibfcon:epiepiS -0.055400 0.413468 -0.134 0.8935 bfibfopen:epiepiS 0.144438 0.496949 0.291 0.7716 bfibfext:epiepiImp -0.065013 0.463085 -0.140 0.8885 bfibfneur:epiepiImp -0.150381 0.393607 -0.382 0.7028 bfibfagree:epiepiImp 0.335234 0.478316 0.701 0.4842 bfibfcon:epiepiImp 0.021982 0.394558 0.056 0.9556 bfibfopen:epiepiImp 0.355919 0.539411 0.660 0.5101 bfibfext:epiepilie -0.053018 0.424651 -0.125 0.9008 bfibfneur:epiepilie -0.088985 0.420050 -0.212 0.8324 bfibfagree:epiepilie 0.474549 0.474784 1.000 0.3188 bfibfcon:epiepilie 0.026155 0.382927 0.068 0.9456 bfibfopen:epiepilie -0.086944 0.463418 -0.188 0.8514 bfibfext:epiepiNeur 0.700558 0.460097 1.523 0.1294 bfibfneur:epiepiNeur 0.579644 0.369567 1.568 0.1184 bfibfagree:epiepiNeur -0.475656 0.442734 -1.074 0.2839 bfibfcon:epiepiNeur -0.070141 0.460483 -0.152 0.8791 bfibfopen:epiepiNeur -0.213680 0.417856 -0.511 0.6097 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.607 on 201 degrees of freedom Multiple R-Squared: 0.444, Adjusted R-squared: 0.3638 F-statistic: 5.534 on 29 and 201 DF, p-value: 5.879e-14