R Guide -- Regressopm

Introduction
Getting the data
Regressions from the raw data
Regressions from the correlation matrix
Regressions from the raw data or correlation matrix using the setCor function

Introduction

A standard problem in psychology is to predict a dependent variable as a function of multiple independent variables. This is, of course, the problem of multiple regression. R does this as one case of the standard linear model. This little tutorial shows how to do multiple regression using classic R or some convenient functions in the psych package.

model=Y~X

Both Y and X can be matrices, in which case as many multiple regressions will be done as the number of variables in Y.

Consider the following (artificial) data set of ability tests, motivation tests, and performance measures. The Ability tests are (simulated) GRE Verbal, Quantitative, and Advanced, the motivation tests measure Need for Achievement and Anxiety, the Performance measures are graduate GPA, rated performance on the Prelims, and quality of the Masters Thesis.

Getting the data and finding the correlation matrix

Somehow we need to read the data and find the correlations.

datafilename="http://personality-project.org/R/datasets/psychometrics.prob2.txt" 
dataset =read.table(datafilename,header=TRUE)  #read the data file
#or
dataset <- read.file(datafilename)    #this uses the psych read.file function
names(dataset)                                 #what are the variables?
dataset=dataset[,-1]                           #get rid of the ID
names(dataset)                                 #check the names again
round(cor(dataset),2)                         #find the correlation matrix
dataset=scale(dataset)                        #convert to standardized scores
dataset=data.frame(dataset)

#gives this output

> datafilename="http://personality-project.org/R/datasets/psychometrics.prob2.txt" 
> dataset <- read.file(datafilename)    #this uses the psych read.file function
Data from the .txt file http://personality-project.org/R/datasets/psychometrics.prob2.txt has been loaded.
>  names(dataset)                                 #what are the variables?
[1] "ID"     "GREV"   "GREQ"   "GREA"   "Ach"    "Anx"    "Prelim" "GPA"    "MA"    
> dataset=dataset[,-1]                           #get rid of the ID
> names(dataset)                                 #check the names again
[1] "GREV"   "GREQ"   "GREA"   "Ach"    "Anx"    "Prelim" "GPA"    "MA"    
> lowerCor(dataset)     #lowerCor finds the correlations and prints it in a nice way
       ID    GREV  GREQ  GREA  Ach   Anx   Prelm GPA   MA   
ID      1.00                                                
GREV   -0.01  1.00                                          
GREQ    0.00  0.73  1.00                                    
GREA   -0.01  0.64  0.60  1.00                              
Ach     0.00  0.01  0.01  0.45  1.00                        
Anx    -0.01  0.01  0.01 -0.39 -0.56  1.00                  
Prelim  0.02  0.43  0.38  0.57  0.30 -0.23  1.00            
GPA     0.00  0.42  0.37  0.52  0.28 -0.22  0.42  1.00      
MA     -0.01  0.32  0.29  0.45  0.26 -0.22  0.36  0.31  1.00
> dataset=scale(dataset)                        #convert to standardized scores
> dataset=data.frame(dataset)

Regressions from the raw data

The typical multiple regression predicts a criterion in terms of a set of predictors. For this example, form ability, motivation and performance subsets and then predict the performance measures from the other two subsets. We use the 'with' function to specify the data set

with(dataset,{                         #we are doing several operations with this dataset
ability=cbind(GREV,GREQ,GREA)                  #form three different composites
motivation=cbind(Ach,Anx)
perform=cbind(Prelim,GPA,MA)
model.regress=lm(perform~ability+motivation)
model.regress                                  #show the resulting regression
print(model.regress,digits=3)                 #another way of showing the result
summary(model.regress)                          #yet another way of showing the output
#produces this output
})                      #this the end of the with statement


                                      #show the resulting regression

> with(dataset,{
+ ability=cbind(GREV,GREQ,GREA)                  #form three different composites
+ motivation=cbind(Ach,Anx)
+ perform=cbind(Prelim,GPA,MA)
+ model.regress=lm(perform~ability+motivation)
+ model.regress                                  #show the resulting regression
+ print(model.regress,digits=3)                 #another way of showing the result
+ summary(model.regress)                          #yet another way of showing the output
+ #produces this output
+ })

Call:
lm(formula = perform ~ ability + motivation)

Coefficients:
               Prelim     GPA        MA       
(Intercept)     6.438630   2.515062   1.803721
abilityGREV     0.001395   0.000940   0.000485
abilityGREQ     0.000402   0.000246   0.000122
abilityGREA     0.004257   0.001430   0.001531
motivationAch   0.012290   0.006068   0.004832
motivationAnx  -0.000908  -0.002383  -0.002280

Response Prelim :

Call:
lm(formula = Prelim ~ ability + motivation)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.02566 -0.58533 -0.00426  0.53273  2.87633 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    6.4386298  0.3270311  19.688  < 2e-16 ***
abilityGREV    0.0013946  0.0004247   3.283 0.001061 ** 
abilityGREQ    0.0004018  0.0004028   0.998 0.318722    
abilityGREA    0.0042566  0.0004781   8.903  < 2e-16 ***
motivationAch  0.0122898  0.0036989   3.323 0.000925 ***
motivationAnx -0.0009080  0.0034600  -0.262 0.793050    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8598 on 994 degrees of freedom
Multiple R-squared:  0.3429,	Adjusted R-squared:  0.3396 
F-statistic: 103.8 on 5 and 994 DF,  p-value: < 2.2e-16


Response GPA :

Call:
lm(formula = GPA ~ ability + motivation)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3069 -0.3080  0.0112  0.3032  1.2733 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2.5150625  0.1608958  15.632  < 2e-16 ***
abilityGREV    0.0009396  0.0002090   4.497 7.72e-06 ***
abilityGREQ    0.0002464  0.0001982   1.243 0.214094    
abilityGREA    0.0014297  0.0002352   6.078 1.73e-09 ***
motivationAch  0.0060682  0.0018198   3.334 0.000886 ***
motivationAnx -0.0023829  0.0017023  -1.400 0.161883    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.423 on 994 degrees of freedom
Multiple R-squared:  0.2931,	Adjusted R-squared:  0.2895 
F-statistic: 82.43 on 5 and 994 DF,  p-value: < 2.2e-16


Response MA :

Call:
lm(formula = MA ~ ability + motivation)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.31674 -0.30219  0.01197  0.30640  1.42213 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.8037210  0.1666887  10.821  < 2e-16 ***
abilityGREV    0.0004850  0.0002165   2.240   0.0253 *  
abilityGREQ    0.0001216  0.0002053   0.592   0.5537    
abilityGREA    0.0015309  0.0002437   6.282 4.98e-10 ***
motivationAch  0.0048316  0.0018854   2.563   0.0105 *  
motivationAnx -0.0022796  0.0017636  -1.293   0.1964    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4383 on 994 degrees of freedom
Multiple R-squared:  0.2177,	Adjusted R-squared:  0.2138 
F-statistic: 55.32 on 5 and 994 DF,  p-value: < 2.2e-16


Call:
lm(formula = perform ~ ability + motivation)

Coefficients:
               Prelim      GPA         MA        
(Intercept)    -5.627e-16   1.786e-15   3.019e-15
abilityGREV     1.399e-01   1.987e-01   1.041e-01
abilityGREQ     3.944e-02   5.098e-02   2.555e-02
abilityGREA     4.041e-01   2.861e-01   3.111e-01
motivationAch   1.143e-01   1.190e-01   9.621e-02
motivationAnx  -8.500e-03  -4.703e-02  -4.569e-02

> print(model,digits=3)                          #another way of showing the result

Call:
lm(formula = perform ~ ability + motivation)

Coefficients:
               Prelim     GPA        MA       
(Intercept)    -5.63e-16   1.79e-15   3.02e-15
abilityGREV     1.40e-01   1.99e-01   1.04e-01
abilityGREQ     3.94e-02   5.10e-02   2.56e-02
abilityGREA     4.04e-01   2.86e-01   3.11e-01
motivationAch   1.14e-01   1.19e-01   9.62e-02
motivationAnx  -8.50e-03  -4.70e-02  -4.57e-02

> summary(model)                                 #yet another way of showing the output
Response Prelim :

Call:
lm(formula = Prelim ~ ability + motivation)

Residuals:
      Min        1Q    Median        3Q       Max 
-2.859621 -0.553213 -0.004029  0.503496  2.718487 

Coefficients:
                Estimate Std. Error   t value Pr(>|t|)    
(Intercept)   -5.627e-16  2.570e-02 -2.19e-14 1.000000    
abilityGREV    1.399e-01  4.259e-02     3.283 0.001061 ** 
abilityGREQ    3.944e-02  3.954e-02     0.998 0.318722    
abilityGREA    4.041e-01  4.539e-02     8.903  < 2e-16 ***
motivationAch  1.143e-01  3.441e-02     3.323 0.000925 ***
motivationAnx -8.500e-03  3.239e-02    -0.262 0.793050    
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 0.8126 on 994 degrees of freedom
Multiple R-Squared: 0.3429,	Adjusted R-squared: 0.3396 
F-statistic: 103.8 on 5 and 994 DF,  p-value: < 2.2e-16 


Response GPA :

Call:
lm(formula = GPA ~ ability + motivation)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.60408 -0.61369  0.02232  0.60416  2.53718 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.786e-15  2.665e-02 6.7e-14 1.000000    
abilityGREV    1.987e-01  4.418e-02   4.497 7.72e-06 ***
abilityGREQ    5.098e-02  4.101e-02   1.243 0.214094    
abilityGREA    2.861e-01  4.708e-02   6.078 1.73e-09 ***
motivationAch  1.190e-01  3.569e-02   3.334 0.000886 ***
motivationAnx -4.703e-02  3.360e-02  -1.400 0.161883    
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 0.8429 on 994 degrees of freedom
Multiple R-Squared: 0.2931,	Adjusted R-squared: 0.2895 
F-statistic: 82.43 on 5 and 994 DF,  p-value: < 2.2e-16 


Response MA :

Call:
lm(formula = MA ~ ability + motivation)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.66414 -0.61142  0.02422  0.61993  2.87736 

Coefficients:
                Estimate Std. Error  t value Pr(>|t|)    
(Intercept)    3.019e-15  2.804e-02 1.08e-13   1.0000    
abilityGREV    1.041e-01  4.648e-02    2.240   0.0253 *  
abilityGREQ    2.555e-02  4.314e-02    0.592   0.5537    
abilityGREA    3.111e-01  4.952e-02    6.282 4.98e-10 ***
motivationAch  9.621e-02  3.754e-02    2.563   0.0105 *  
motivationAnx -4.569e-02  3.534e-02   -1.293   0.1964    
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 0.8867 on 994 degrees of freedom
Multiple R-Squared: 0.2177,	Adjusted R-squared: 0.2138 
F-statistic: 55.32 on 5 and 994 DF,  p-value: < 2.2e-16

Regression analysis based upon the correlation matrix -- using the solve(a,b) function.

These regressions were done with the raw data. This is very useful if we want to examine diagnostics for the regression, examine the residuals, look for badly fitted subjects, etc. But sometimes we want to do the regressions from the correlation matrix or covariance matrix without reverting to the raw data.

Using the same data set as before, we find the correlation matrix and then do some basic matrix algebra to do the multiple correlations.

r.matrix=cor(dataset)                    #find the correlations 
round(r.matrix,2)                        #show themm to 2 decimal placeds
a.matrix=r.matrix[1:5,1:5]               #the a.matrix are the predictors
b.matrix=r.matrix[1:5,6:8]               #the b.matrix has the criteria
a.matrix                                 #show the a.matrix
b.matrix                                 #show the b.matrix
model.mat=solve(a.matrix,b.matrix)       #solve the equation bY~aX
round(model.mat,2)                       #show the answer to 2 decimals
print(model.regression,digits=2)         #compare this to the multiple regression using the raw data



#produces this output

> r.matrix=cor(dataset)                    #find the correlations 
> round(r.matrix,2)                        #show themm to 2 decimal placeds
       GREV GREQ  GREA   Ach   Anx Prelim   GPA    MA
GREV   1.00 0.73  0.64  0.01  0.01   0.43  0.42  0.32
GREQ   0.73 1.00  0.60  0.01  0.01   0.38  0.37  0.29
GREA   0.64 0.60  1.00  0.45 -0.39   0.57  0.52  0.45
Ach    0.01 0.01  0.45  1.00 -0.56   0.30  0.28  0.26
Anx    0.01 0.01 -0.39 -0.56  1.00  -0.23 -0.22 -0.22
Prelim 0.43 0.38  0.57  0.30 -0.23   1.00  0.42  0.36
GPA    0.42 0.37  0.52  0.28 -0.22   0.42  1.00  0.31
MA     0.32 0.29  0.45  0.26 -0.22   0.36  0.31  1.00
> a.matrix=r.matrix[1:5,1:5]               #the a.matrix are the predictors
> b.matrix=r.matrix[1:5,6:8]               #the b.matrix has the criteria
> a.matrix                                 #show the a.matrix
            GREV        GREQ       GREA          Ach          Anx
GREV 1.000000000 0.728847119  0.6411517  0.005612846  0.010193145
GREQ 0.728847119 1.000000000  0.5963450  0.006846819  0.005469727
GREA 0.641151694 0.596345026  1.0000000  0.453464501 -0.389629680
Ach  0.005612846 0.006846819  0.4534645  1.000000000 -0.556186667
Anx  0.010193145 0.005469727 -0.3896297 -0.556186667  1.000000000
> b.matrix                                 #show the b.matrix
         Prelim        GPA         MA
GREV  0.4282415  0.4194686  0.3222975
GREQ  0.3830881  0.3669720  0.2873879
GREA  0.5724319  0.5162068  0.4545495
Ach   0.3033434  0.2763810  0.2634665
Anx  -0.2278877 -0.2224047 -0.2192203
> model.mat=solve(a.matrix,b.matrix)       #solve the equation bY~aX
> round(model.mat,2)                       #show the answer to 2 decimals
     Prelim   GPA    MA
GREV   0.14  0.20  0.10
GREQ   0.04  0.05  0.03
GREA   0.40  0.29  0.31
Ach    0.11  0.12  0.10
Anx   -0.01 -0.05 -0.05
> print(model.regression,digits=2)                    #compare this to the multiple regression using the raw data

Call:
lm(formula = perform ~ ability + motivation)

Coefficients:
               Prelim    GPA       MA      
(Intercept)    -5.6e-16   1.8e-15   3.0e-15
abilityGREV     1.4e-01   2.0e-01   1.0e-01
abilityGREQ     3.9e-02   5.1e-02   2.6e-02
abilityGREA     4.0e-01   2.9e-01   3.1e-01
motivationAch   1.1e-01   1.2e-01   9.6e-02
motivationAnx  -8.5e-03  -4.7e-02  -4.6e-02

Compare the results based upon the correlation matrix to that based upon the raw data. To find R^2, we need to multiply the beta weights times the zero order correlations

Doing this with the setCor function

The setCor function in the psych package will do these operations, either on the raw data or on the correlation matrix.

First do it on the raw data

> set.cor(y=6:8,x=1:5,data=dataset)
Call: set.cor(y = 6:8, x = 1:5, data = dataset)

Multiple Regression from raw data 

Beta weights 
     Prelim   GPA    MA
GREV   0.14  0.20  0.10
GREQ   0.04  0.05  0.03
GREA   0.40  0.29  0.31
Ach    0.11  0.12  0.10
Anx   -0.01 -0.05 -0.05

Multiple R 
Prelim    GPA     MA 
  0.59   0.54   0.47 

Multiple R2 
Prelim    GPA     MA 
  0.34   0.29   0.22 

 SE of Beta weights 
     Prelim  GPA   MA
GREV   0.04 0.04 0.05
GREQ   0.04 0.04 0.04
GREA   0.05 0.05 0.05
Ach    0.03 0.04 0.04
Anx    0.03 0.03 0.04

 t of Beta Weights 
     Prelim   GPA    MA
GREV   3.28  4.50  2.24
GREQ   1.00  1.24  0.59
GREA   8.90  6.08  6.28
Ach    3.32  3.33  2.56
Anx   -0.26 -1.40 -1.29

Probability of t < 
      Prelim     GPA      MA
GREV 0.00110 7.7e-06 2.5e-02
GREQ 0.32000 2.1e-01 5.5e-01
GREA 0.00000 1.7e-09 5.0e-10
Ach  0.00092 8.9e-04 1.1e-02
Anx  0.79000 1.6e-01 2.0e-01

 Shrunken R2 
Prelim    GPA     MA 
  0.34   0.29   0.21 

Standard Error of R2  
Prelim    GPA     MA 
 0.024  0.024  0.023 

F 
Prelim    GPA     MA 
103.76  82.43  55.32 

Probability of F < 
Prelim    GPA     MA 
     0      0      0 

 degrees of freedom of regression 
[1]   5 994

Various estimates of between set correlations
Squared Canonical Correlations 
[1] 0.4943 0.0036 0.0017
Chisq of canonical correlations 
[1] 678.1   3.6   1.7

 Average squared canonical correlation =  0.17
 Cohen's Set Correlation R2 =  0.5
 Shrunken Set Correlation R2 =  0.49
 F and df of Cohen's Set Correlation  51.35 15 2725.07

Now do this on the correlation matrix. First find the correlations using the lowerCor function:

 
 > r.matrix <- lowerCor(dataset)
       GREV  GREQ  GREA  Ach   Anx   Prelm GPA   MA   
GREV    1.00                                          
GREQ    0.73  1.00                                    
GREA    0.64  0.60  1.00                              
Ach     0.01  0.01  0.45  1.00                        
Anx     0.01  0.01 -0.39 -0.56  1.00                  
Prelim  0.43  0.38  0.57  0.30 -0.23  1.00            
GPA     0.42  0.37  0.52  0.28 -0.22  0.42  1.00      
MA      0.32  0.29  0.45  0.26 -0.22  0.36  0.31  1.00
> 
> set.cor(y=6:8,x=1:5,data=r.matrix)
Call: set.cor(y = 6:8, x = 1:5, data = r.matrix)

Multiple Regression from matrix input 

Beta weights 
     Prelim   GPA    MA
GREV   0.14  0.20  0.10
GREQ   0.04  0.05  0.03
GREA   0.40  0.29  0.31
Ach    0.11  0.12  0.10
Anx   -0.01 -0.05 -0.05

Multiple R 
Prelim    GPA     MA 
  0.59   0.54   0.47 

Multiple R2 
Prelim    GPA     MA 
  0.34   0.29   0.22 

Various estimates of between set correlations
Squared Canonical Correlations 
[1] 0.4943 0.0036 0.0017
Chisq of canonical correlations 
NULL

 Average squared canonical correlation =  0.17
 Cohen's Set Correlation R2 =  0.5
>

Specifying the sample size in the previous example will also give the standard errors and various goodness of fit statistics.

 
 > setCor(y=6:8,x=1:5,data=r.matrix,n.obs=1000)
Call: set.cor(y = 6:8, x = 1:5, data = r.matrix, n.obs = 1000)

Multiple Regression from matrix input 

Beta weights 
     Prelim   GPA    MA
GREV   0.14  0.20  0.10
GREQ   0.04  0.05  0.03
GREA   0.40  0.29  0.31
Ach    0.11  0.12  0.10
Anx   -0.01 -0.05 -0.05

Multiple R 
Prelim    GPA     MA 
  0.59   0.54   0.47 

Multiple R2 
Prelim    GPA     MA 
  0.34   0.29   0.22 

 SE of Beta weights 
     Prelim  GPA   MA
GREV   0.04 0.04 0.05
GREQ   0.04 0.04 0.04
GREA   0.05 0.05 0.05
Ach    0.03 0.04 0.04
Anx    0.03 0.03 0.04

 t of Beta Weights 
     Prelim   GPA    MA
GREV   3.28  4.50  2.24
GREQ   1.00  1.24  0.59
GREA   8.90  6.08  6.28
Ach    3.32  3.33  2.56
Anx   -0.26 -1.40 -1.29

Probability of t < 
      Prelim     GPA      MA
GREV 0.00110 7.7e-06 2.5e-02
GREQ 0.32000 2.1e-01 5.5e-01
GREA 0.00000 1.7e-09 5.0e-10
Ach  0.00092 8.9e-04 1.1e-02
Anx  0.79000 1.6e-01 2.0e-01

 Shrunken R2 
Prelim    GPA     MA 
  0.34   0.29   0.21 

Standard Error of R2  
Prelim    GPA     MA 
 0.024  0.024  0.023 

F 
Prelim    GPA     MA 
103.76  82.43  55.32 

Probability of F < 
Prelim    GPA     MA 
     0      0      0 

 degrees of freedom of regression 
[1]   5 994

Various estimates of between set correlations
Squared Canonical Correlations 
[1] 0.4943 0.0036 0.0017
Chisq of canonical correlations 
[1] 678.1   3.6   1.7

 Average squared canonical correlation =  0.17
 Cohen's Set Correlation R2 =  0.5
 Shrunken Set Correlation R2 =  0.49
 F and df of Cohen's Set Correlation  51.35 15 2725.07