Minor patch to psych was put onto CRAN last week

psych has been updated to 2.2.4 Get the latest release from CRAN

The next line (to install the latest release) has been commented out because I have already updated my version. However, you should do this once.

#install.packages("psych",repos="http://personality-project.org/r",type="source")
library(psych) #make it active
library(psychTools) #make this active as well
sessionInfo()   #make sure it is there

## R version 4.4.0 beta (2024-04-12 r86412)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] psychTools_2.4.4 psych_2.4.4     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-164      cli_3.6.1         knitr_1.43        rlang_1.1.1       xfun_0.39        
##  [6] jsonlite_1.8.5    rtf_0.4-14.1      htmltools_0.5.8.1 sass_0.4.9        rmarkdown_2.26   
## [11] grid_4.4.0        evaluate_0.21     jquerylib_0.1.4   fastmap_1.1.1     yaml_2.3.7       
## [16] lifecycle_1.0.3   compiler_4.4.0    rstudioapi_0.16.0 R.oo_1.26.0       lattice_0.22-6   
## [21] digest_0.6.31     R6_2.5.1          foreign_0.8-86    mnormt_2.1.1      parallel_4.4.0   
## [26] bslib_0.7.0       R.methodsS3_1.8.2 tools_4.4.0       cachem_1.0.8

#news(Version >  "2.2.4",package="psych") #read the news for this release
 packageDate(pkg="psych") #should be April 23

## [1] "2024-04-23"

The linear model

There are many forms of the linear model.

\[\hat{y} = b_1x + e\] is the classic regression model. If x is a dichotomous variable, this is just a t-test. If x is categorical, then we can do this as an analysis of variance. In this case, we think of x causing y. The slope (b) is the ratio of the covariance of x and y divided by the variance of x.

\[b_1 = \frac{cov_{xy}}{var_x}\]

\[= \frac{\sigma_{xy}}{\sigma_x^2} \]

If we think of y causing x, this becomes:

\[\hat{x} = b_1y + e\] That is, y causes x.

If we are unsure of the direction of causality, we can find the geometric average of the two regressions and find

\[r_{xy} = \frac{cov_{xy}}{\sqrt{\sigma^2_x \sigma^2_y}} = cov_{z_x z_y}\] which is the ratio of the covariance between x and y and the square root of the product of the two variances, or alternatively, the covariances of the standard scores.

If we have more than one predictor, we have Multiple regression

\[\hat{y} = b_1 x_1 + b_2 x_2 + e\] If \(x_1\) and \(x_2\) are categorical, this is also known as an analysis of variance.

A path model representation of regression and multiple regression

First, just show the npk dataset from core R. Can you see any relationships here?

npk   #The data come from Venables

##    block N P K yield
## 1      1 0 1 1  49.5
## 2      1 1 1 0  62.8
## 3      1 0 0 0  46.8
## 4      1 1 0 1  57.0
## 5      2 1 0 0  59.8
## 6      2 1 1 1  58.5
## 7      2 0 0 1  55.5
## 8      2 0 1 0  56.0
## 9      3 0 1 0  62.8
## 10     3 1 1 1  55.8
## 11     3 1 0 0  69.5
## 12     3 0 0 1  55.0
## 13     4 1 0 0  62.0
## 14     4 1 1 1  48.8
## 15     4 0 0 1  45.5
## 16     4 0 1 0  44.2
## 17     5 1 1 0  52.0
## 18     5 0 0 0  51.5
## 19     5 1 0 1  49.8
## 20     5 0 1 1  48.8
## 21     6 1 0 1  57.2
## 22     6 1 1 0  59.0
## 23     6 0 1 1  53.2
## 24     6 0 0 0  56.0

As we do for any data set, we want to describe it and perhaps examine the pairwise relationships.

describe(npk)

##        vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## block*    1 24  3.50 1.74   3.50    3.50 2.22  1.0  6.0   5.0 0.00    -1.41 0.36
## N*        2 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## P*        3 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## K*        4 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## yield     5 24 54.88 6.17  55.65   54.75 6.15 44.2 69.5  25.3 0.24    -0.51 1.26

pairs.panels(npk)

When we describe the data, the variables are marked by an * which indicates that they are not numeric (they are actuall Factors with levels), but are treated by describe as numeric. We can examine the str (structure) of the data to see this.

The pairs.panels function automatically converts Factor or Character values to numeric.

Unfortunately, setCor does not. (But lmCor does!)

We needed to this by hand but don’t anymore.

str(npk)

## 'data.frame':    24 obs. of  5 variables:
##  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
##  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
##  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
##  $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...

Examining the correlations of the npk data we see that there are four experimental variables (IVs) and one dependent variable (DV). Because this is a balanced experiment, we see that the correlation of the IVs are all exactly zero and that they have different relationships with the DV.

Standardized regressions are just correlations. We can show the relationships as a path model using the lmCor function. By default, we show the standardized solution which expresses everything as correlations (Left panel) However we may also say that we do not want standardized variables (right panel) to see the raw variance components.

par(mfrow=c(1,2))  #set up a two panel graph
mod0 <- lmCor(yield ~ N + P + K, data=npk,main="Standardized Regression")#show this as a figure
mod0 <- lmCor(yield ~ N + P + K, data=npk,std=FALSE,main="Unstandardized Regression") #do not standardize

par(mfrow=c(1,1)) #set it back to a one panel graph

the t-test examines one variable at a time

If the IV is categorical and has just two levels, we can examine the effect of the IV on the DV by doing a t.test

t.test(yield ~ K, data = npk)

## 
##  Welch Two Sample t-test
## 
## data:  yield by K
## t = 1.6374, df = 17.621, p-value = 0.1193
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.135414  9.102081
## sample estimates:
## mean in group 0 mean in group 1 
##        56.86667        52.88333

The t-test is a special case of the linear model (in this case the X variable just has two values.) We can do the same analysis as the t-test but use the linear model function lm.

model <- lm(yield ~ K, data = npk)
summary(model)

## 
## Call:
## lm(formula = yield ~ K, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.667  -4.083   1.217   4.167  12.633 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   56.867      1.720  33.059   <2e-16 ***
## K1            -3.983      2.433  -1.637    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.959 on 22 degrees of freedom
## Multiple R-squared:  0.1086, Adjusted R-squared:  0.06812 
## F-statistic: 2.681 on 1 and 22 DF,  p-value: 0.1158

Or, we can use the lmCor function to do this.

model1 <- lmCor(yield ~ K, data = npk)

Graphic displays

There are several ways of showing this result. We can do an error.bars

error.bars.by(yield ~ K,data=npk)

Or we can use the lmCor function to do the linear model and draw the graph. Note that the t statistic given for the regression slope is the same as the t statistic when doing the normal t-test.

lmCor(yield ~ K, data = npk)  #same syntax as lm

## Call: lmCor(y = yield ~ K, data = npk)
## 
## Multiple Regression from raw data 
## 
##  DV =  yield 
##             slope  se     t    p lower.ci upper.ci VIF Vy.x
## (Intercept)  0.00 0.2  0.00 1.00    -0.42     0.42   1 0.00
## K           -0.33 0.2 -1.64 0.12    -0.75     0.09   1 0.11
## 
## Residual Standard Error =  0.97  with  22  degrees of freedom
## 
##  Multiple Regression
##          R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2     p
## yield 0.33 0.11 0.33 0.11        0.07      0.1      2.68   1  22 0.116

Analysis of variance and the linear model

Analysis of variance is a special case of the linear model. We can see this when we compare two anovas with two multiple regresssions. The first is just a single Independent Variable (IV) while the second one uses all three predictors. The last analysis (using lmCor) does the same as the lm but draws a path model as well.

mod1 <- aov(yield ~ N,data=npk)
lm1 <- lm(yield ~ N, data=npk)
summary(mod1)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.061 0.0221 *
## Residuals   22  687.1   31.23                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm1) #identical to the anova (in terms of p values, ts are sqrts of Fs)

## 
## Call:
## lm(formula = yield ~ N, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8833 -3.7667  0.1667  3.5583 11.8167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   52.067      1.613  32.274   <2e-16 ***
## N1             5.617      2.281   2.462   0.0221 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.588 on 22 degrees of freedom
## Multiple R-squared:  0.216,  Adjusted R-squared:  0.1803 
## F-statistic: 6.061 on 1 and 22 DF,  p-value: 0.02213

mod2 <- aov(yield ~ N + P + K,data=npk)
lm2 <- lm(yield ~ N + P + K,data=npk)
summary(mod2)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.488 0.0192 *
## P            1    8.4    8.40   0.288 0.5974  
## K            1   95.2   95.20   3.263 0.0859 .
## Residuals   20  583.5   29.17                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm2) #also identical in terms of p values

## 
## Call:
## lm(formula = yield ~ N + P + K, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2667 -3.6542  0.7083  3.4792  9.3333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   54.650      2.205  24.784   <2e-16 ***
## N1             5.617      2.205   2.547   0.0192 *  
## P1            -1.183      2.205  -0.537   0.5974    
## K1            -3.983      2.205  -1.806   0.0859 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.401 on 20 degrees of freedom
## Multiple R-squared:  0.3342, Adjusted R-squared:  0.2343 
## F-statistic: 3.346 on 3 and 20 DF,  p-value: 0.0397

#now do the analysis but draw a path model as well
lmCor(yield ~ N + P + K,data=npk,main = "Regression Model using lmCor")

## Call: lmCor(y = yield ~ N + P + K, data = npk, main = "Regression Model using lmCor")
## 
## Multiple Regression from raw data 
## 
##  DV =  yield 
##             slope   se     t     p lower.ci upper.ci VIF Vy.x
## (Intercept)  0.00 0.18  0.00 1.000    -0.38     0.38   1 0.00
## N            0.46 0.18  2.55 0.019     0.08     0.85   1 0.22
## P           -0.10 0.18 -0.54 0.600    -0.48     0.28   1 0.01
## K           -0.33 0.18 -1.81 0.086    -0.71     0.05   1 0.11
## 
## Residual Standard Error =  0.88  with  20  degrees of freedom
## 
##  Multiple Regression
##          R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2      p
## yield 0.58 0.33 0.52 0.27        0.23     0.12      3.35   3  20 0.0397

Note that although the output looks very different, the t value’s of the path coefficients are the same as the square roots of the F values in the aov model.

Regression with continuous variables

Regression is most appropriate with continuous predictors. We do this for one predictor (SATV) and one DV (ACT). We show the data points, fit with the best fitting linear model, and then show the model.

plot(ACT ~ SATV, data= sat.act)  #show the points
abline(lm(ACT ~ SATV, data=sat.act)) #show the best fitting line

summary(lm(ACT ~ SATV, data=sat.act)) #summarize the results

## 
## Call:
## lm(formula = ACT ~ SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.666  -2.454  -0.031   2.548  16.334 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.871813   0.833334   16.65   <2e-16 ***
## SATV         0.023970   0.001339   17.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.996 on 698 degrees of freedom
## Multiple R-squared:  0.3148, Adjusted R-squared:  0.3138 
## F-statistic: 320.7 on 1 and 698 DF,  p-value: < 2.2e-16

But typically, we have more predictors. Although showing the individual regressions is possible (and we do as a two panel graph) we can also summarize them graphically by a path model. (As we did earlier). When doing regressions, the slopes are in the units of the predictor (e.g. SATV), and are sometimes hard to interpret. Thus, we can express the slopes as standardized values to make them more understandable.

par(mfrow=c(1,2))  #create a two panel graph
plot(ACT ~ SATV, data= sat.act)  #show the points
abline(lm(ACT ~ SATV, data=sat.act)) 
plot(ACT ~ SATQ, data= sat.act)  #show the points
abline(lm(ACT ~ SATQ, data=sat.act))

#do another 2 panel graph    We supress the actual output
mod3 <- lmCor(ACT ~ SATV + SATQ, data=sat.act,main="Standardized (Correlations)")
mod4 <- lmCor(ACT ~ SATV + SATQ, data=sat.act, std=FALSE,main="unstandardized")

par(mfrow=c(1,1)) #convert back to the normal 1 panel graph
summary(lm(ACT ~ SATV + SATQ, data=sat.act))

## 
## Call:
## lm(formula = ACT ~ SATV + SATQ, data = sat.act)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6795  -2.3076   0.0831   2.1924  18.5489 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.565946   0.854058  12.371  < 2e-16 ***
## SATV         0.013284   0.001649   8.054 3.56e-15 ***
## SATQ         0.016142   0.001616   9.990  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.743 on 684 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4015, Adjusted R-squared:  0.3997 
## F-statistic: 229.4 on 2 and 684 DF,  p-value: < 2.2e-16

Moderated multiple regression

An important question is whether one variable affects the relationship between another IV and DV. In analysis of variance terms, this is asking if there is an interaction. The standard anova approach just does this by including the interaction term (shown as a product in the formula). But when we try this with regression, things get more complicated.

The standard anova with an interation term

mod2 <- aov(yield ~ N * P * K,data = npk)
summary(mod2)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.161 0.0245 *
## P            1    8.4    8.40   0.273 0.6082  
## K            1   95.2   95.20   3.099 0.0975 .
## N:P          1   21.3   21.28   0.693 0.4175  
## N:K          1   33.1   33.14   1.078 0.3145  
## P:K          1    0.5    0.48   0.016 0.9019  
## N:P:K        1   37.0   37.00   1.204 0.2887  
## Residuals   16  491.6   30.72                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But when do this as a linear model, it doesn’t match! Compare the p values. Only the 3 way interaction is correct.

summary(lm(yield ~ N * P * K,data = npk))

## 
## Call:
## lm(formula = yield ~ N * P * K, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.133  -4.133   1.250   3.125   8.467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51.4333     3.2002  16.072  2.7e-11 ***
## N1           12.3333     4.5258   2.725    0.015 *  
## P1            2.9000     4.5258   0.641    0.531    
## K1            0.5667     4.5258   0.125    0.902    
## N1:P1        -8.7333     6.4004  -1.365    0.191    
## N1:K1        -9.6667     6.4004  -1.510    0.150    
## P1:K1        -4.4000     6.4004  -0.687    0.502    
## N1:P1:K1      9.9333     9.0515   1.097    0.289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.543 on 16 degrees of freedom
## Multiple R-squared:  0.4391, Adjusted R-squared:  0.1937 
## F-statistic: 1.789 on 7 and 16 DF,  p-value: 0.1586

Show the problem in terms of product terms

The problem is that the product terms are correlated with the main effects. We can show this by forming the product terms. Although tedious, it is possible to find the 4 product terms. We do this, and then put the results into a data.frame which we when plot. Note how the product terms are correlated with their elements. This makes sense, but is wrong if we want to do the modeling.

First we create a new data.frame (NPK) which is made up of the numeric equivalents of what was in npk.

NPK <- char2numeric(npk,flag=FALSE)
describe(npk)  #the original data are not technically numeric, they are factors

##        vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## block*    1 24  3.50 1.74   3.50    3.50 2.22  1.0  6.0   5.0 0.00    -1.41 0.36
## N*        2 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## P*        3 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## K*        4 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## yield     5 24 54.88 6.17  55.65   54.75 6.15 44.2 69.5  25.3 0.24    -0.51 1.26

describe(NPK)  #we have converted them to numeric

##       vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## block    1 24  3.50 1.74   3.50    3.50 2.22  1.0  6.0   5.0 0.00    -1.41 0.36
## N        2 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## P        3 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## K        4 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## yield    5 24 54.88 6.17  55.65   54.75 6.15 44.2 69.5  25.3 0.24    -0.51 1.26

Now show the correlations of the product terms and the main effects. We use the gap parameter in pairs.panels to show a slightly more compact graphic.

 NP <- NPK$N * NPK$P
 NK <- NPK$N * NPK$K
 PK <- NPK$P * NPK$K
 PKN <- PK * NPK$N
 NPK.prods <- data.frame(NPK,NP,NK,PK,PKN)
pairs.panels(NPK.prods,gap=0)  #show the correlations, tighten up the figure

The simple products are correlated with each other, and with the basic factors. To get around this problem, we center the data using the scale function. scale will center and if we set scale=TRUE it will also standardize (divide by the standard deviation). The default is to standardize, so we specify that it is false. We also need to convert the output of the scale function from a matrix to a data.frame.

 centered.NPK <- scale(NPK,scale=FALSE)
 centered.NPK <- data.frame(scale(NPK,scale=FALSE))
 c.NP <- centered.NPK$N * centered.NPK$P
 c.NK <- centered.NPK$N * centered.NPK$K
 c.PK <- centered.NPK$P * centered.NPK$K
 c.PKN <- c.PK * centered.NPK$N
 center.prod <- data.frame(centered.NPK,c.NP,c.NK,c.PK,c.PKN)
 pairs.panels(center.prod) #show the results grapically

 lowerCor(center.prod)  #show it a correlation matrix as well

##       block N     P     K     yield c.NP  c.NK  c.PK  c.PKN
## block  1.00                                                
## N      0.00  1.00                                          
## P      0.00  0.00  1.00                                    
## K      0.00  0.00  0.00  1.00                              
## yield -0.16  0.46 -0.10 -0.33  1.00                        
## c.NP   0.00  0.00  0.00  0.00 -0.16  1.00                  
## c.NK   0.00  0.00  0.00  0.00 -0.19  0.00  1.00            
## c.PK   0.00  0.00  0.00  0.00  0.02  0.00  0.00  1.00      
## c.PKN -0.29  0.00  0.00  0.00  0.21  0.00  0.00  0.00  1.00

Note the products are now uncorrelated. If we now use the linear model function, it agrees with our anova output.

lm.mod <- lm(yield ~ N * P * K, data = centered.NPK) #do the analysis on the centered data
summary(lm.mod)  #show the moderated multiple regression

## 
## Call:
## lm(formula = yield ~ N * P * K, data = centered.NPK)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.133  -4.133   1.250   3.125   8.467 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.137e-15  1.131e+00   0.000   1.0000  
## N            5.617e+00  2.263e+00   2.482   0.0245 *
## P           -1.183e+00  2.263e+00  -0.523   0.6082  
## K           -3.983e+00  2.263e+00  -1.760   0.0975 .
## N:P         -3.767e+00  4.526e+00  -0.832   0.4175  
## N:K         -4.700e+00  4.526e+00  -1.038   0.3145  
## P:K          5.667e-01  4.526e+00   0.125   0.9019  
## N:P:K        9.933e+00  9.052e+00   1.097   0.2887  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.543 on 16 degrees of freedom
## Multiple R-squared:  0.4391, Adjusted R-squared:  0.1937 
## F-statistic: 1.789 on 7 and 16 DF,  p-value: 0.1586

summary(mod2) #compare with the original anova output

##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.161 0.0245 *
## P            1    8.4    8.40   0.273 0.6082  
## K            1   95.2   95.20   3.099 0.0975 .
## N:P          1   21.3   21.28   0.693 0.4175  
## N:K          1   33.1   33.14   1.078 0.3145  
## P:K          1    0.5    0.48   0.016 0.9019  
## N:P:K        1   37.0   37.00   1.204 0.2887  
## Residuals   16  491.6   30.72                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

`lmCor` does the centering automatically when finding product terms

mod1 <- lmCor(yield ~ N * P * K, data = npk)

mod1  #show the results

## Call: lmCor(y = yield ~ N * P * K, data = npk)
## 
## Multiple Regression from raw data 
## 
##  DV =  yield 
##             slope   se     t     p lower.ci upper.ci VIF Vy.x
## (Intercept)  0.00 0.19  0.00 1.000    -0.40     0.40   1 0.00
## N            0.46 0.19  2.48 0.025     0.07     0.86   1 0.22
## P           -0.10 0.19 -0.52 0.610    -0.49     0.30   1 0.01
## K           -0.33 0.19 -1.76 0.097    -0.73     0.07   1 0.11
## N*P         -0.16 0.19 -0.83 0.420    -0.55     0.24   1 0.02
## N*K         -0.19 0.19 -1.04 0.310    -0.59     0.20   1 0.04
## P*K          0.02 0.19  0.13 0.900    -0.37     0.42   1 0.00
## N*P*K        0.21 0.19  1.10 0.290    -0.19     0.60   1 0.04
## 
## Residual Standard Error =  0.9  with  16  degrees of freedom
## 
##  Multiple Regression
##          R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2     p
## yield 0.66 0.44 0.56 0.31        0.19      0.1      1.79   7  16 0.159

Now do this with continuous x1 and a dichotomous x2 and graph the effect

Once again, to the analysis correctly, we need to center the data.

cen.sat.act <- data.frame(scale(sat.act,scale=FALSE))
lm.mod2 <- lm(ACT ~ gender * SATV , data= cen.sat.act)
summary(lm.mod2)

## 
## Call:
## lm(formula = ACT ~ gender * SATV, data = cen.sat.act)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4788  -2.3916  -0.0332   2.4952  15.6586 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.005219   0.150824  -0.035   0.9724    
## gender      -0.255094   0.315645  -0.808   0.4193    
## SATV         0.023913   0.001337  17.886   <2e-16 ***
## gender:SATV -0.005137   0.002785  -1.844   0.0655 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.99 on 696 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3159 
## F-statistic: 108.6 on 3 and 696 DF,  p-value: < 2.2e-16

There is a suggestion of an interaction between gender and SATV. To see this graphically, we draw the data (using a different plot character for males and females) and then draw two lines, one for males, and one for females. The interaction is a difference in the slopes of these regressions.

plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==1,]))
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==2,]),lty="dashed")

# or another way of doing this is with the by statment
plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
by(sat.act,sat.act$gender, function(x) abline(lm(ACT ~ SATV, data=x),lty=c("solid","dashed")[x$gender]))

## sat.act$gender: 1
## NULL
## --------------------------------------------------------------------------- 
## sat.act$gender: 2
## NULL

by(sat.act,sat.act$gender,function(x) lm(ACT ~ SATV, data=x))  #this tells us which is which

## sat.act$gender: 1
## 
## Call:
## lm(formula = ACT ~ SATV, data = x)
## 
## Coefficients:
## (Intercept)         SATV  
##    12.03133      0.02724  
## 
## --------------------------------------------------------------------------- 
## sat.act$gender: 2
## 
## Call:
## lm(formula = ACT ~ SATV, data = x)
## 
## Coefficients:
## (Intercept)         SATV  
##     14.9214       0.0221

text(220,17,"males")
text(220,22,"females")

Some adventures in coding – dummy.code

Sometimes we want to take a categorical variable in which the categories are arbitrary (e.g., college major, ways of coping) and create a dummy code that has the value of 1 for people in that category and 0 for everyone else. Clearly we need at least 3 categories for this to make sense.

For following, we will take the education variable from the spi data set, convert the education levels into 0/1 dummy codes, and then find those items that most correlate with each level of education.

But first, lets examine the `dummy.code` function

dummy.code  #show the code for the function

## function (x, group = NULL, na.rm = TRUE, top = NULL, min = NULL) 
## {
##     t <- table(x)
##     t <- sort(t, decreasing = TRUE)
##     if (!is.null(min)) {
##         top <- sum(t >= min)
##     }
##     if (is.null(top)) 
##         top <- length(t)
##     t <- t[1:top]
##     lt <- length(t)
##     n.obs <- length(x)
##     if (is.null(group)) {
##         new <- matrix(0, nrow = n.obs, ncol = lt)
##         if (na.rm) {
##             new[is.na(x), ] <- NA
##         }
##         xlev <- factor(x, levels = names(t))
##         for (i in 1:n.obs) {
##             new[i, xlev[i]] <- 1
##         }
##         colnames(new) <- names(t)
##     }
##     else {
##         new <- rep(0, n.obs)
##         xlev <- as.factor(x)
##         if (na.rm) {
##             new[is.na(x)] <- NA
##         }
##         for (i in 1:n.obs) {
##             new[i] <- xlev[i] %in% group
##         }
##     }
##     return(new)
## }
## <bytecode: 0x10bc3d0a8>
## <environment: namespace:psych>

We can examine each line of this code by trying it.

x <- spi$education
na.rm <- TRUE
t <- table(x)  # Call the table function on x
t   # we do not do this in the function, but we do it here to examine it

## x
##    1    2    3    4    5    6    7    8 
##  394  369 1060  260  123  567  165  392

lt <- length(t)   # how many different categories

Then we create where we want to put the results

 n.obs <- length(x)
 new <- matrix(0,nrow=n.obs,ncol=lt)
  if(na.rm) {new[is.na(x),] <- NA } #added 10/20/17
 headTail(new) #show the first  and last 4 rows

##        X1   X2   X3   X4   X5   X6   X7   X8
## 1    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2       0    0    0    0    0    0    0    0
## 3       0    0    0    0    0    0    0    0
## 4    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## ...   ...  ...  ...  ...  ...  ...  ...  ...
## 3997    0    0    0    0    0    0    0    0
## 3998    0    0    0    0    0    0    0    0
## 3999    0    0    0    0    0    0    0    0
## 4000 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>

 xlev <- as.factor(x)   #show the str of this
 str(xlev)  #because we will be using these to index the matrix

##  Factor w/ 8 levels "1","2","3","4",..: NA 6 4 NA 1 4 7 2 8 8 ...

 for (i in 1:n.obs) {new[i,xlev[i]] <- 1}    #Do the work
  headTail(new) #show the first  and last 4 rows

##        X1   X2   X3   X4   X5   X6   X7   X8
## 1    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2       0    0    0    0    0    1    0    0
## 3       0    0    0    1    0    0    0    0
## 4    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## ...   ...  ...  ...  ...  ...  ...  ...  ...
## 3997    0    0    1    0    0    0    0    0
## 3998    0    1    0    0    0    0    0    0
## 3999    0    0    0    0    0    1    0    0
## 4000 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>

  colnames(new) <- names(t)    #if we have names for the categories, they appear here

Now try this function on `spi[“education”]

What are the units of education? We can see them by looking at the help file for the spi data set. Respondents education: less than 12, HS grad, current univ, some univ, associate degree, college degree, in grad/prof, grad/prof degree

edu.codes <- dummy.code(spi$education)
dim(edu.codes)  #we have created the new variables

## [1] 4000    8

describe(edu.codes)   #the means should add up to 1 across all the variables

##   vars    n mean   sd median trimmed mad min max range skew kurtosis   se
## 3    1 3330 0.32 0.47      0    0.27   0   0   1     1 0.78    -1.39 0.01
## 6    2 3330 0.17 0.38      0    0.09   0   0   1     1 1.75     1.08 0.01
## 1    3 3330 0.12 0.32      0    0.02   0   0   1     1 2.36     3.58 0.01
## 8    4 3330 0.12 0.32      0    0.02   0   0   1     1 2.37     3.62 0.01
## 2    5 3330 0.11 0.31      0    0.01   0   0   1     1 2.48     4.14 0.01
## 4    6 3330 0.08 0.27      0    0.00   0   0   1     1 3.14     7.89 0.00
## 7    7 3330 0.05 0.22      0    0.00   0   0   1     1 4.15    15.22 0.00
## 5    8 3330 0.04 0.19      0    0.00   0   0   1     1 4.91    22.10 0.00

Empirical scale construction

Another function, bestScales can use those codes to find which items most correlate with each category. This function finds the items that most correlate with the criterion.

bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary)  #don't use education as a predictor
bs.edu

## 
## Call = bestScales(x = spi[-6], criteria = edu.codes, dictionary = spi.dictionary)
## The items most correlated with the criteria yield r's of 
##   correlation n.items
## 3        0.28       1
## 6        0.22       1
## 1        0.35      10
## 8        0.42       6
## 2        0.13       3
## 4        0.16       4
## 7          NA       0
## 5        0.11       1
## 
## The best items, their correlations and content  are 
## $`3`
##         3 item_id         item item_scale resp_type B5 L27
## age -0.28     age age in years         NA        NA NA  NA
## 
## $`6`
##        6 item_id         item item_scale resp_type B5 L27
## age 0.22     age age in years         NA        NA NA  NA
## 
## $`1`
##            1 item_id                                           item item_scale resp_type    B5
## age    -0.33  age    age in years                                        NA          NA  NA   
## smoke  -0.14  smoke  smoking (never ... > 20/day)                        NA          NA  NA   
## q_578   0.13  q_578  Dislike myself.                                     IPIP        reg Neuro
## q_1505  0.13  q_1505 Panic easily.                                       IPIP        reg Neuro
## q_4296  0.13  q_4296 Tell a lot of lies.                                 EPQ:P       reg Agree
## q_952   0.13  q_952  Get angry easily.                                   IPIP        reg NA   
## q_4249  0.13  q_4249 Would call myself a nervous person.                 EPQ:N       reg Neuro
## q_501   0.11  q_501  Cheat to get ahead.                                 IPIP        reg Agree
## q_35    0.11  q_35   Act without thinking.                               IPIP        reg NA   
## q_811   0.11  q_811  Feel a sense of worthlessness or hopelessness.      IPIP        reg Neuro
##                 L27
## age    NA          
## smoke  NA          
## q_578  WellBeing   
## q_1505 Anxiety     
## q_4296 Honesty     
## q_952  Irritability
## q_4249 Anxiety     
## q_501  Honesty     
## q_35   Impulsivity 
## q_811  WellBeing   
## 
## $`8`
##            8 item_id                                   item item_scale resp_type    B5
## age     0.41  age    age in years                                NA          NA  NA   
## q_1132  0.15  q_1132 Have read the great literary classics.      IPIP        reg NA   
## q_1024 -0.13  q_1024 Hang around doing nothing.                  IPIP        reg NA   
## q_422   0.12  q_422  Can handle a lot of information.            IPIP        reg Open 
## q_4249 -0.11  q_4249 Would call myself a nervous person.         EPQ:N       reg Neuro
## q_1452 -0.10  q_1452 Neglect my duties.                          IPIP        reg Consc
##                    L27
## age    NA             
## q_1132 ArtAppreciation
## q_1024 EasyGoingness  
## q_422  Intellect      
## q_4249 Anxiety        
## q_1452 Industry       
## 
## $`2`
##            2 item_id                                      item item_scale resp_type    B5
## age    -0.12  age    age in years                                    NA         NA  NA   
## q_1694 -0.11  q_1694 Set high standards for myself and others.       IPIP       reg Consc
## p1edu  -0.10  p1edu  Parent 1 education                              NA         NA  NA   
##                  L27
## age    NA           
## q_1694 Perfectionism
## p1edu  NA           
## 
## $`4`
##            4 item_id                                    item item_scale resp_type B5       L27
## smoke   0.16  smoke  smoking (never ... > 20/day)                  NA         NA  NA NA       
## age     0.12  age    age in years                                  NA         NA  NA NA       
## q_2765 -0.11  q_2765 Am happy with my life.                        IPIP       reg NA WellBeing
## health -0.11  health self rated health (poor ... excellent)        NA         NA  NA NA       
## 
## $`7`
## [1] 7 y
## <0 rows> (or 0-length row.names)
## 
## $`5`
##        5 item_id         item item_scale resp_type B5 L27
## age 0.11     age age in years         NA        NA NA  NA

The need to boot strap the solution

The solution just listed is sensitive to sample variations. We can test how sensitive it is by using the boot strap. Boot strapping is basically taking the observed sample as the population and then sampling from that population multiple times (with replacement). This leads to a numer of samples from the ‘population’ that differ to some amount. By examining how much they vary we can get a degree of confidence in our results. We will do the bootstrap 100 times and then examine the pooled results.

bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary[1:2], n.iter=100)
bs.edu

## 
## Call = bestScales(x = spi[-6], criteria = edu.codes, n.iter = 100, dictionary = spi.dictionary[1:2])
##   derivation.mean derivation.sd validation.m validation.sd final.valid final.wtd N.wtd
## 3            0.29        0.0162        0.282         0.015        0.28      0.30    10
## 6            0.24        0.0187        0.224         0.021        0.22      0.21    10
## 1            0.36        0.0116        0.340         0.018        0.33      0.31    10
## 8            0.41        0.0168        0.402         0.026        0.42      0.36    10
## 2            0.15        0.0289        0.131         0.022        0.12      0.20    10
## 4            0.17        0.0324        0.149         0.027        0.16      0.21    10
## 7            0.11        0.0085        0.029         0.023          NA      0.14    10
## 5            0.12        0.0120        0.103         0.020          NA      0.14    10
## 
##  Best items on each scale with counts of replications
## 
##  Criterion =  3 
##     Freq mean.r sd.r item_id         item
## age  100  -0.27 0.01     age age in years
## 
##  Criterion =  6 
##     Freq mean.r sd.r item_id         item
## age  100   0.22 0.02     age age in years
## 
##  Criterion =  1 
##     Freq mean.r sd.r item_id         item
## age  100  -0.33 0.01     age age in years
## 
##  Criterion =  8 
##        Freq mean.r sd.r item_id                                   item
## age     100   0.41 0.02     age                           age in years
## q_1132  100   0.15 0.02  q_1132 Have read the great literary classics.
## q_1024   90  -0.12 0.02  q_1024             Hang around doing nothing.
## 
##  Criterion =  2 
##     Freq mean.r sd.r item_id         item
## age   95  -0.12 0.01     age age in years
## 
##  Criterion =  4 
##       Freq mean.r sd.r item_id                         item
## smoke  100   0.16 0.02   smoke smoking (never ... > 20/day)
## 
##  Criterion =  7 
##       Freq mean.r sd.r item_id                                   item
## q_871    8  -0.08 0.02   q_871 Feel that most people cant be trusted.
## 
##  Criterion =  5 
##     Freq mean.r sd.r item_id         item
## age   67   0.11 0.02     age age in years

350 Week 5 part 2: Regression and the Linear Model

William Revelle

April 25, 2022

Minor patch to psych was put onto CRAN last week

The linear model

A path model representation of regression and multiple regression

the t-test examines one variable at a time

Graphic displays

Analysis of variance and the linear model

Regression with continuous variables

Moderated multiple regression

The standard anova with an interation term

Show the problem in terms of product terms

`lmCor` does the centering automatically when finding product terms

Now do this with continuous x1 and a dichotomous x2 and graph the effect

Some adventures in coding – dummy.code

But first, lets examine the `dummy.code` function

Empirical scale construction

The need to boot strap the solution

350 Week 5 part 2: Regression and the Linear Model

William Revelle

April 25, 2022

Minor patch to psych was put onto CRAN last week

The linear model

A path model representation of regression and multiple regression

the t-test examines one variable at a time

Graphic displays

Analysis of variance and the linear model

Regression with continuous variables

Moderated multiple regression

The standard anova with an interation term

Show the problem in terms of product terms

lmCor does the centering automatically when finding product terms

Now do this with continuous x1 and a dichotomous x2 and graph the effect

Some adventures in coding – dummy.code

But first, lets examine the dummy.code function

Empirical scale construction

The need to boot strap the solution

`lmCor` does the centering automatically when finding product terms

But first, lets examine the `dummy.code` function