Recent upgrade to psych

psych has been updated to 2.0.4 Get the latest release from the pmc server. (It has been submitted to CRAN but is not there yet.)

The next line (to install the latest release) has been commented out because I have already updated my version. However, you should do this once.

#install.packages("psych",repos="http://personality-project.org/r",type="source")
library(psych) #make it active
library(psychTools)  #this one as well
 sessionInfo()   #make sure it is there
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] psychTools_2.0.1 psych_2.0.4     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1      lattice_0.20-38 digest_0.6.15   rprojroot_1.3-2 grid_3.6.3      nlme_3.1-144   
##  [7] backports_1.1.3 magrittr_1.5    evaluate_0.10.1 stringi_1.3.1   rmarkdown_1.9   tools_3.6.3    
## [13] foreign_0.8-75  stringr_1.4.0   yaml_2.1.18     parallel_3.6.3  compiler_3.6.3  mnormt_1.5-5   
## [19] htmltools_0.3.6 knitr_1.20
news(Version >  "2.0.0",package="psych") #read the news for this release
##                  Changes in version 2.0.3 (2020-04-30)                  
## 
## Additions
## 
##     o   Added crossValidation to be compatible with setCor and
##  bestItems
## 
##     o   Added the weights option to statsBy (requested by Joshua
##  Tuttle) - still needs to be worked on for weights within groups
## 
##     o   Minor fix to error.dots to allow more control on the figures
##  and to make sure that we are calling psych::describe
## 
##     o   Now reports the keys used in scoreItems
## 
##     o   Improved faLookup to handle bassAckward output
## 
##     o   Improved the print and summary from mediate to include the
##  intercept term.
## 
##     o   Improved bassAckward.diagram to allow labeling of higher level
##  factors.
## 
## Bugs fixed
## 
##     o   Fixed summary for setCor to properly print betas (coefficients)
##  Error introduced in 1.9.12.31 (reported by Hans Marius Norbom)
## 
##     o   Fixed bug in bassAckward for orthogonal rotations (reported by
##  Eva Billen )
## 
##     o   Fixed bug in fa.congruence for case of factor loadings provided
##  by user rather than a psych function (reported by Lisa
##  Debruine)
## 
##     o   Changed the way to find the pooled correlation in statsBy by
##  finding the average fisherZ transformed correlation (back
##  transformed to r) rather than just using the average weighted
##  r.
## 
##     o   error.bars (which was calling error.bar.by) was not passing all
##  the parameters to error.bars.by.  (Reported by Lorien Elleman).
## 
##     o   errorCircles was not plotting error circles if there were NAs
##  in any of the descriptive statistics\
## 
##     o   Fixed the equamax call to cfT to include kappa in the call
##  (reported by David Condon and Gerard Saucier)
## 
##                  Changes in version 2.0.1 (2020-01-13)                  
## 
## Introduction
## 
##     o   Version 1.9.12.xx is the development release of the psych
##  package. It is available as a source file for Macs or PCs in
##  the repository at <URL: http://personality-project.org/r>. The
##  released version on CRAN is 1.9.12.  The second digit reflects
##  the year (i.e., 2018), the third set the month (i.e., 1.8.3 was
##  released in March of 2018, the last two digits of development
##  versions reflect either an minor change or the day of any
##  modifications, e.g. 1.8.3.3 was the third attempt to get 1.8.3
##  released.  1.7.8 was released in August, 2017.
## 
##     o   To install this development version, use the command:
##  install.packages("psych",
##  repos="http://personality-project.org/r", type="source").
##  Remember to restart R and library(psych) to make the new
##  version active.
## 
##     o   The psych package includes functions and data sets to do
##  classic and modern psychometrics and to analyze personality and
##  experimental psychological data sets. The psych package has
##  been developed as a supplement to courses in research methods
##  in psychology, personality research, and graduate level
##  psychometric theory. The functions are a supplement to the text
##  (in progress): An introduction to psychometric theory with
##  applications in R.
## 
##     o   Additional functions are added sporadically.
## 
##     o   This news file reports changes that have been made as the
##  package has been developed.
## 
##     o   To report bugs, send email to <URL:
##  mailto:revelle@northwestern.edu> using bug.report.
## 
## To do
## 
##     o   Suggestions are welcome, but the current list includes the
##  following(and has included for a long time, so lets be
##  patient):
## 
##     o   Add confirmatory clustering to ICLUST
## 
##     o   Get cluster scores in ICLUST - analogous to factor scores
##  (requested by Ben Shalet)
## 
##     o   Add the ability to create multiple groups in sim.irt functions
## 
##     o   Find canonical loadings in set.cor
## 
##     o   Add omega factor extension figure option (requested by Sylia
##  Wilson)
## 
##     o   Add option to do subject density and item density plot to IRT
##  plot.
## 
##     o   add the ability to scale radar plots of raw data from min to
##  max, and add a scale to radar and spider plots (e.g. circular
##  histograms)
## 
##     o   add the ability to add labels to lavaan.diagram
## 
##     o   clean up lavaan diagram so that the output is more readable
## 
##     o   Add the ability to return the true scores for subjects when
##  simulating structures.  This will help fitting reliability
##  models but will require not using mvrnorm
## 
##     o   To help those who want to simulate a bifactor model, make it
##  explicit
## 
##     o   Check the bias in bootstrap resampling using cor.ci
## 
##     o   Start to use the drop=FALSE when doing subsetting (doing this
##  more and more)
## 
##     o   add symmetric=TRUE to eigen calls, think about only.values=TRUE
##  for some cases
## 
##     o   Add an analysis if DIF to the irt functions (requested by David
##  Condon)
## 
##     o   Add some power functions
## 
##     o   Add CIs for means and skews as an option (for describe as well
##  as describeBy) Requested by Aaron Wichman
## 
##     o   Fix statsBy to handle the case of NULL groups.  (particularly
##  for the cors=TRUE option).
## 
##     o   Add a function for the Meng, Rosenthal, Rubin tests for
##  multiple comparisons of correlations.
## 
##     o   Clean up the various sim functions so that they are better
##  documented
## 
##     o   Clean up documentation of score.irt etc.  (partly done)
## 
##     o   Add complexity to omega solution
## 
##     o   Add statistic to fa to report max fa. (partly done with fm =
##  minrank)
## 
##     o   Parallize iterations for confidence intervals in omega
## 
##     o   Probably should drop much of the stats when doing iterations
##  with fa or omega
## 
##     o   Find unbiased cis when doing bootstrap
## 
##     o   Allow the choice between regression vs. correlation in esem
##  between X and Y sets.
## 
##     o   Allow for different rotation options in X and Y sets in esem,
##  allow omega style hierarchical structure for x (and y)
## 
##     o   Fix bug in esem.diagram for ability- bifactor 3,3
##  simple=FALSE,cut=.2
## 
##     o   Improve documentation for esem and cosinor
## 
##     o   Add interbattery diagram to the structure.
## 
##     o   fix bug in lavaan.diagram for lr=FALSE
## 
##     o   Add Krippendorf's alpha?
## 
##     o   Add the ability to weight factor residuals to allow WLMS
## 
##     o   Add the ability to get factor scores for higher order factors
##  (requested by Brandon Vaughn)
## 
##     o   Create a reliabilities function to combine omega and alpha and
##  splitHalf
## 
##     o   Add the Hull method for number of factors
## 
##     o   Add an option for the correlation type in bestScales (requested
##  by Lorien Elleman).
## 
##     o   Change help file examples so they are cleaner for people who
##  example(fn)
## 
##     o   Make the main functions more "tidy" in terms of their output
## 
##     o   Add testthat functionality
## 
##     o   Add parse capabilities to error.bars, error.dots, spider, etc.
##  so that grouping variables can be specified in formula mode.
## 
##     o   Consider creating a new package (psych.tools) to add some of
##  the useful utilities.
## 
##     o   Possible bug in cohen.kappa confidence intervals (reported by
##  Marco Fornili)
## 
##     o   Need to add documentation to mediate/moderate/setCor for what
##  dfs are used and why.
## 
##     o   Examine the stats::anova, stats::summary.lm, stats::anova.lm
##  for hints on how to do these anova comparisons
## 
##     o   Add an n.iter option to pca (requested by Michael Wood)
## 
##     o   There is a bug in bestItems for the case of no iterations and
##  dictionary = something
## 
##     o   Fix bug in fa.lookup for omega g values that are negative
## 
##     o   Fix bug in bestScales that does not report items if not
##  iterating
## 
##     o   Adjust stats in fa.extend to report the raw and then the
##  extended values
## 
##     o   Improve documentation for extension diagram to clarify that it
##  works with fa.results
## 
##     o   Add s.e. to intercept in set cor (requested by Eric Youngstrom)
## 
##     o   Add just check variables specified for being numeric (instead
##  of entire data.frame) (suggested by Fransisco Wilheim)
## 
##     o   Pretty up the output from mediate (perhaps make a mediate2latex
##  function
## 
##     o   There is a bug in setCor for finding the intercepts
## 
##     o   Add a anova.test for two setCor objects
## 
##     o   Add formula input to densityBy
## 
##     o   Improve the print function for bassAckwards
## 
##     o   Cross reference fa.multi with other functions
## 
##     o   Problem with scoreIrt.1pl with keys of length 2
## 
##     o   Change examples to dontrun rather than commented out
## 
##     o   Minor change in msqR help file
## 
##     o   Minor change to summary for testRetest
## 
##     o   add formula input to densityBy and violinBy
## 
##     o   Check the raw df in esem
## 
##     o   Add RMSEA, TFI, etc. to esem (requested by Gabe Orona )
## 
##     o   Add formula mode to violinBy and the errorbarsBy, etc.
## 
##     o   Add formula mode to lowerCor and lowerMat (to be able to choose
##  a few variables in a data set)
## 
##     o   Need to fix bug in alpha where it finds R from cov2cor instead
##  of directly (reported by Karl Ove Hufthammer )
## 
##     o   Look at ICC3 in responde to query from Fred Oswald
## 
##     o   Add a random restart option for various rotation solutions
## 
##     o   Fix bug in fa or fa.stats reported by Kyle Husman (something is
##  wrong with promax solutions)
## 
##     o   Fix passing parameter to Equamax (reported by Erich Studerus)
## 
##     o   Add a warning when doing omegaSem with one factor (or at least
##  make it work - reported by Erich Studerus)
## 
##     o   Fix a bug in testRetest (reported by Miquel Biron)
## 
##     o   Fix a bug in setCor summary function (reported by Hans Marius
##  Norbom)
## 
##     o   Fix a bug in alpha when finding averager R (need to use cor
##  rather than cov2cor when treating missing data)
## 
##     o   Fix alpha and scoreItems to think more carefully about what to
##  do if an item has no variance.  Score it anyway, just don't use
##  it for item statistics. (raised by Eric Green).
## 
##     o   Fix a bug in polychoric (reported with suggested patch by Ben
##  Mainwaring.)
## 
##     o   Fix a bug in testRetest (reported by Miguel Biron-Lattes )
## 
##     o   Improve the help file for omegaFromSem and the error handling
##  to quit nicely when asked for omega of a single factor.
## 
##     o   Add weight option to statsBy (but the question is how to handle
##  missing data)

The linear model

There are many forms of the linear model.

\[\hat{y} = b_1x + e\] is the classic regression model. If x is a dichotomous variable, this is just a t-test. If x is categorical, then we can do this as an analysis of variance. In this case, we think of x causing y. The slope (b) is the ratio of the covariance of x and y divided by the variance of x.

\[b_1 = \frac{cov_{xy}}{var_x}\]

\[= \frac{\sigma_{xy}}{\sigma_x^2} \]

If we think of y causing x, this becomes:

\[\hat{x} = b_1y + e\] That is, y causes x.

If we are unsure of the direction of causality, we can find the geometric average of the two regressions and find

\[r_{xy} = \frac{cov_{xy}}{\sqrt{\sigma^2_x \sigma^2_y}} = cov_{z_x z_y}\] which is the ratio of the covariance between x and y and the square root of the product of the two variances, or alternatively, the covariances of the standard scores.

If we have more than one predictor, we have Multiple regression

\[\hat{y} = b_1 x_1 + b_2 x_2 + e\] If \(x_1\) and \(x_2\) are categorical, this is also known as an analysis of variance.

A path model representation of regression and multiple regression

First, just show the npk dataset from core R. Can you see any relationships here?

npk   #The data come from Venables
##    block N P K yield
## 1      1 0 1 1  49.5
## 2      1 1 1 0  62.8
## 3      1 0 0 0  46.8
## 4      1 1 0 1  57.0
## 5      2 1 0 0  59.8
## 6      2 1 1 1  58.5
## 7      2 0 0 1  55.5
## 8      2 0 1 0  56.0
## 9      3 0 1 0  62.8
## 10     3 1 1 1  55.8
## 11     3 1 0 0  69.5
## 12     3 0 0 1  55.0
## 13     4 1 0 0  62.0
## 14     4 1 1 1  48.8
## 15     4 0 0 1  45.5
## 16     4 0 1 0  44.2
## 17     5 1 1 0  52.0
## 18     5 0 0 0  51.5
## 19     5 1 0 1  49.8
## 20     5 0 1 1  48.8
## 21     6 1 0 1  57.2
## 22     6 1 1 0  59.0
## 23     6 0 1 1  53.2
## 24     6 0 0 0  56.0

As we do for any data set, we want to describe it and perhaps examine the pairwise relationships.

describe(npk) 
##        vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## block*    1 24  3.50 1.74   3.50    3.50 2.22  1.0  6.0   5.0 0.00    -1.41 0.36
## N*        2 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## P*        3 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## K*        4 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## yield     5 24 54.88 6.17  55.65   54.75 6.15 44.2 69.5  25.3 0.24    -0.51 1.26
pairs.panels(npk)

When we describe the data, the variables are marked by an * which indicates that they are not numeric (they are actuall Factors with levels), but are treated by describe as numeric. We can examine the str (structure) of the data to see this.

The pairs.panels function automatically converts Factor or Character values to numeric.

str(npk)
## 'data.frame':    24 obs. of  5 variables:
##  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
##  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
##  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
##  $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...

Examining the correlations of the npk data we see that there are four experimental variables (IVs) and one dependent variable (DV). Because this is a balanced experiment, we see that the correlation of the IVs are all exactly zero and that they have different relationships with the DV.

Standardized regressions are just correlations. We can show the relationships as a path model using the setCor function. By default, we show the standardized solution which expresses everything as correlations (Left panel) However we may also say that we do not want standardized variables (right panel) to see the raw variance components.

par(mfrow=c(1,2))  #set up a two panel graph
mod0 <- setCor(yield ~ N + P + K, data=npk,main="Standardized Regression")#show this as a figure
mod0 <- setCor(yield ~ N + P + K, data=npk,std=FALSE,main="Unstandardized Regression") #do not standardize

par(mfrow=c(1,1)) #set it back to a one panel graph

the t-test examines one variable at a time

If the IV is categorical and has just two levels, we can examine the effect of the IV on the DV by doing a t.test

t.test(yield ~ K, data = npk)
## 
##  Welch Two Sample t-test
## 
## data:  yield by K
## t = 1.6374, df = 17.621, p-value = 0.1193
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.135414  9.102081
## sample estimates:
## mean in group 0 mean in group 1 
##        56.86667        52.88333

The t-test is a special case of the linear model (in this case the X variable just has two values.) We can do the same analysis as the t-test but use the linear model function lm.

model <- lm(yield ~ K, data = npk)
summary(model)
## 
## Call:
## lm(formula = yield ~ K, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.667  -4.083   1.217   4.167  12.633 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   56.867      1.720  33.059   <2e-16 ***
## K1            -3.983      2.433  -1.637    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.959 on 22 degrees of freedom
## Multiple R-squared:  0.1086, Adjusted R-squared:  0.06812 
## F-statistic: 2.681 on 1 and 22 DF,  p-value: 0.1158

Graphic displays

There are several ways of showing this result. We can do an error.bars

error.bars.by(yield ~ K,data=npk)

Or we can use the setCor function to do the linear model and draw the graph. Unfortunately, there is a bug (just discovevered) for the case of a single IV in setCor and we need to do a work around. Note that the t statistic given for the regression slope is the same as the t statistic when doing the normal t-test.

tempR <- lowerCor(levels2numeric(npk))  #convert to numeric
##       block N     P     K     yield
## block  1.00                        
## N      0.00  1.00                  
## P      0.00  0.00  1.00            
## K      0.00  0.00  0.00  1.00      
## yield -0.16  0.46 -0.10 -0.33  1.00
setCor(yield ~ K, data = tempR, n.obs=24)  #do the regression from the correlations

## Call: setCor(y = yield ~ K, data = tempR, n.obs = 24)
## 
## Multiple Regression from matrix input 
## 
##  DV =  yield 
##   slope  se     t    p lower.ci upper.ci VIF
## K -0.33 0.2 -1.64 0.12    -0.75     0.09   1
## 
## Residual Standard Error =  0.97  with  22  degrees of freedom
## 
##  Multiple Regression
##          R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2     p
## yield 0.33 0.11 0.33 0.11        0.07      0.1      2.68   1  22 0.116

Analysis of variance and the linear model

Analysis of variance is a special case of the linear model. We can see this when we compare two anovas with two multiple regresssions. The first is just a single Independent Variable (IV) while the second one uses all three predictors. The last analysis (using setCor) does the same as the lm but draws a path model as well.

mod1 <- aov(yield ~ N,data=npk)
lm1 <- lm(yield ~ N, data=npk)
summary(mod1)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.061 0.0221 *
## Residuals   22  687.1   31.23                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm1) #identical to the anova (in terms of p values, ts are sqrts of Fs)
## 
## Call:
## lm(formula = yield ~ N, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8833 -3.7667  0.1667  3.5583 11.8167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   52.067      1.613  32.274   <2e-16 ***
## N1             5.617      2.281   2.462   0.0221 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.588 on 22 degrees of freedom
## Multiple R-squared:  0.216,  Adjusted R-squared:  0.1803 
## F-statistic: 6.061 on 1 and 22 DF,  p-value: 0.02213
mod2 <- aov(yield ~ N + P + K,data=npk)
lm2 <- lm(yield ~ N + P + K,data=npk)
summary(mod2)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.488 0.0192 *
## P            1    8.4    8.40   0.288 0.5974  
## K            1   95.2   95.20   3.263 0.0859 .
## Residuals   20  583.5   29.17                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm2) #also identical in terms of p values
## 
## Call:
## lm(formula = yield ~ N + P + K, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2667 -3.6542  0.7083  3.4792  9.3333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   54.650      2.205  24.784   <2e-16 ***
## N1             5.617      2.205   2.547   0.0192 *  
## P1            -1.183      2.205  -0.537   0.5974    
## K1            -3.983      2.205  -1.806   0.0859 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.401 on 20 degrees of freedom
## Multiple R-squared:  0.3342, Adjusted R-squared:  0.2343 
## F-statistic: 3.346 on 3 and 20 DF,  p-value: 0.0397
#now do the analysis but draw a path model as well
setCor(yield ~ N + P + K,data=npk,main = "Regression Model using setCor")

## Call: setCor(y = yield ~ N + P + K, data = npk, main = "Regression Model using setCor")
## 
## Multiple Regression from raw data 
## 
##  DV =  yield 
##             slope   se     t     p lower.ci upper.ci VIF
## (Intercept)  0.00 0.18  0.00 1.000    -0.38     0.38   1
## N            0.46 0.18  2.55 0.019     0.08     0.85   1
## P           -0.10 0.18 -0.54 0.600    -0.48     0.28   1
## K           -0.33 0.18 -1.81 0.086    -0.71     0.05   1
## 
## Residual Standard Error =  0.88  with  20  degrees of freedom
## 
##  Multiple Regression
##          R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2      p
## yield 0.58 0.33 0.52 0.27        0.23     0.12      3.35   3  20 0.0397

Note that although the output looks very different, the t value’s of the path coefficients are the same as the square roots of the F values in the aov model.

Regression with continuous variables

Regression is most appropriate with continuous predictors. We do this for one predictor (SATV) and one DV (ACT). We show the data points, fit with the best fitting linear model, and then show the model.

plot(ACT ~ SATV, data= sat.act)  #show the points
abline(lm(ACT ~ SATV, data=sat.act)) #show the best fitting line

summary(lm(ACT ~ SATV, data=sat.act)) #summarize the results
## 
## Call:
## lm(formula = ACT ~ SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.666  -2.454  -0.031   2.548  16.334 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.871813   0.833334   16.65   <2e-16 ***
## SATV         0.023970   0.001339   17.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.996 on 698 degrees of freedom
## Multiple R-squared:  0.3148, Adjusted R-squared:  0.3138 
## F-statistic: 320.7 on 1 and 698 DF,  p-value: < 2.2e-16

But typically, we have more predictors. Although showing the individual regressions is possible (and we do as a two panel graph) we can also summarize them graphically by a path model. (As we did earlier). When doing regressions, the slopes are in the units of the predictor (e.g. SATV), and are sometimes hard to interpret. Thus, we can express the slopes as standardized values to make them more understandable.

par(mfrow=c(1,2))  #create a two panel graph
plot(ACT ~ SATV, data= sat.act)  #show the points
abline(lm(ACT ~ SATV, data=sat.act)) 
plot(ACT ~ SATQ, data= sat.act)  #show the points
abline(lm(ACT ~ SATQ, data=sat.act))

#do another 2 panel graph    We supress the actual output
mod3 <- setCor(ACT ~ SATV + SATQ, data=sat.act,main="Standardized (Correlations)")
mod4 <- setCor(ACT ~ SATV + SATQ, data=sat.act, std=FALSE,main="unstandardized")

par(mfrow=c(1,1)) #convert back to the normal 1 panel graph
summary(lm(ACT ~ SATV + SATQ, data=sat.act))
## 
## Call:
## lm(formula = ACT ~ SATV + SATQ, data = sat.act)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6795  -2.3076   0.0831   2.1924  18.5489 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.565946   0.854058  12.371  < 2e-16 ***
## SATV         0.013284   0.001649   8.054 3.56e-15 ***
## SATQ         0.016142   0.001616   9.990  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.743 on 684 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4015, Adjusted R-squared:  0.3997 
## F-statistic: 229.4 on 2 and 684 DF,  p-value: < 2.2e-16

Moderated multiple regression

An important question is whether one variable affects the relationship between another IV and DV. In analysis of variance terms, this is asking if there is an interaction. The standard anova approach just does this by including the interaction term (shown as a product in the formula). But when we try this with regression, things get more complicated.

The standard anova with an interation term

mod2 <- aov(yield ~ N * P * K,data = npk)
summary(mod2)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.161 0.0245 *
## P            1    8.4    8.40   0.273 0.6082  
## K            1   95.2   95.20   3.099 0.0975 .
## N:P          1   21.3   21.28   0.693 0.4175  
## N:K          1   33.1   33.14   1.078 0.3145  
## P:K          1    0.5    0.48   0.016 0.9019  
## N:P:K        1   37.0   37.00   1.204 0.2887  
## Residuals   16  491.6   30.72                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But when do this as a linear model, it doesn’t match! Compare the p values. Only the 3 way interaction is correct.

summary(lm(yield ~ N * P * K,data = npk))
## 
## Call:
## lm(formula = yield ~ N * P * K, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.133  -4.133   1.250   3.125   8.467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51.4333     3.2002  16.072  2.7e-11 ***
## N1           12.3333     4.5258   2.725    0.015 *  
## P1            2.9000     4.5258   0.641    0.531    
## K1            0.5667     4.5258   0.125    0.902    
## N1:P1        -8.7333     6.4004  -1.365    0.191    
## N1:K1        -9.6667     6.4004  -1.510    0.150    
## P1:K1        -4.4000     6.4004  -0.687    0.502    
## N1:P1:K1      9.9333     9.0515   1.097    0.289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.543 on 16 degrees of freedom
## Multiple R-squared:  0.4391, Adjusted R-squared:  0.1937 
## F-statistic: 1.789 on 7 and 16 DF,  p-value: 0.1586

Show the problem in terms of product terms

The problem is that the product terms are correlated with the main effects. We can show this by forming the product terms. Although tedious, it is possible to find the 4 product terms. We do this, and then put the results into a data.frame which we when plot. Note how the product terms are correlated with their elements. This makes sense, but is wrong if we want to do the modeling.

First we create a new data.frame (NPK) which is made up of the numeric equivalents of what was in npk.

NPK <- char2numeric(npk)
describe(npk)  #the original data are not technically numeric, they are factors
##        vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## block*    1 24  3.50 1.74   3.50    3.50 2.22  1.0  6.0   5.0 0.00    -1.41 0.36
## N*        2 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## P*        3 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## K*        4 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## yield     5 24 54.88 6.17  55.65   54.75 6.15 44.2 69.5  25.3 0.24    -0.51 1.26
describe(NPK)  #we have converted them to numeric
##       vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## block    1 24  3.50 1.74   3.50    3.50 2.22  1.0  6.0   5.0 0.00    -1.41 0.36
## N        2 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## P        3 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## K        4 24  1.50 0.51   1.50    1.50 0.74  1.0  2.0   1.0 0.00    -2.08 0.10
## yield    5 24 54.88 6.17  55.65   54.75 6.15 44.2 69.5  25.3 0.24    -0.51 1.26

Now show the correlations of the product terms and the main effects. We use the gap parameter in pairs.panels to show a slightly more compact graphic.

 NP <- NPK$N * NPK$P
 NK <- NPK$N * NPK$K
 PK <- NPK$P * NPK$K
 PKN <- PK * NPK$N
 NPK.prods <- data.frame(NPK,NP,NK,PK,PKN)
pairs.panels(NPK.prods,gap=0)  #show the correlations, tighten up the figure

The simple products are correlated with each other, and with the basic factors. To get around this problem, we center the data using the scale function. scale will center and if we set scale=TRUE it will also standardize (divide by the standard deviation). The default is to standardize, so we specify that it is false. We also need to convert the output of the scale function from a matrix to a data.frame.

 centered.NPK <- scale(NPK,scale=FALSE)
 centered.NPK <- data.frame(scale(NPK,scale=FALSE))
 c.NP <- centered.NPK$N * centered.NPK$P
 c.NK <- centered.NPK$N * centered.NPK$K
 c.PK <- centered.NPK$P * centered.NPK$K
 c.PKN <- c.PK * centered.NPK$N
 center.prod <- data.frame(centered.NPK,c.NP,c.NK,c.PK,c.PKN)
 pairs.panels(center.prod) #show the results grapically

 lowerCor(center.prod)  #show it a correlation matrix as well
##       block N     P     K     yield c.NP  c.NK  c.PK  c.PKN
## block  1.00                                                
## N      0.00  1.00                                          
## P      0.00  0.00  1.00                                    
## K      0.00  0.00  0.00  1.00                              
## yield -0.16  0.46 -0.10 -0.33  1.00                        
## c.NP   0.00  0.00  0.00  0.00 -0.16  1.00                  
## c.NK   0.00  0.00  0.00  0.00 -0.19  0.00  1.00            
## c.PK   0.00  0.00  0.00  0.00  0.02  0.00  0.00  1.00      
## c.PKN -0.29  0.00  0.00  0.00  0.21  0.00  0.00  0.00  1.00

Note the products are now uncorrelated. If we now use the linear model function, it agrees with our anova output.

lm.mod <- lm(yield ~ N * P * K, data = centered.NPK) #do the analysis on the centered data
summary(lm.mod)  #show the moderated multiple regression
## 
## Call:
## lm(formula = yield ~ N * P * K, data = centered.NPK)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.133  -4.133   1.250   3.125   8.467 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.269e-15  1.131e+00   0.000   1.0000  
## N            5.617e+00  2.263e+00   2.482   0.0245 *
## P           -1.183e+00  2.263e+00  -0.523   0.6082  
## K           -3.983e+00  2.263e+00  -1.760   0.0975 .
## N:P         -3.767e+00  4.526e+00  -0.832   0.4175  
## N:K         -4.700e+00  4.526e+00  -1.038   0.3145  
## P:K          5.667e-01  4.526e+00   0.125   0.9019  
## N:P:K        9.933e+00  9.052e+00   1.097   0.2887  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.543 on 16 degrees of freedom
## Multiple R-squared:  0.4391, Adjusted R-squared:  0.1937 
## F-statistic: 1.789 on 7 and 16 DF,  p-value: 0.1586
summary(mod2) #compare with the original anova output
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.161 0.0245 *
## P            1    8.4    8.40   0.273 0.6082  
## K            1   95.2   95.20   3.099 0.0975 .
## N:P          1   21.3   21.28   0.693 0.4175  
## N:K          1   33.1   33.14   1.078 0.3145  
## P:K          1    0.5    0.48   0.016 0.9019  
## N:P:K        1   37.0   37.00   1.204 0.2887  
## Residuals   16  491.6   30.72                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now do this with continuous x1 and a dichotomous x2 and graph the effect

Once again, to the analysis correctly, we need to center the data.

cen.sat.act <- data.frame(scale(sat.act,scale=FALSE))
lm.mod2 <- lm(ACT ~ gender * SATV , data= cen.sat.act)
summary(lm.mod2)
## 
## Call:
## lm(formula = ACT ~ gender * SATV, data = cen.sat.act)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4788  -2.3916  -0.0332   2.4952  15.6586 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.005219   0.150824  -0.035   0.9724    
## gender      -0.255094   0.315645  -0.808   0.4193    
## SATV         0.023913   0.001337  17.886   <2e-16 ***
## gender:SATV -0.005137   0.002785  -1.844   0.0655 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.99 on 696 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3159 
## F-statistic: 108.6 on 3 and 696 DF,  p-value: < 2.2e-16

There is a suggestion of an interaction between gender and SATV. To see this graphically, we draw the data (using a different plot character for males and females) and then draw two lines, one for males, and one for females. The interaction is a difference in the slopes of these regressions.

plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==1,]))
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==2,]),lty="dashed")

# or another way of doing this is with the by statment
plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
by(sat.act,sat.act$gender, function(x) abline(lm(ACT ~ SATV, data=x),lty=c("solid","dashed")[x$gender]))
## sat.act$gender: 1
## NULL
## --------------------------------------------------------------------------- 
## sat.act$gender: 2
## NULL
by(sat.act,sat.act$gender,function(x) lm(ACT ~ SATV, data=x))  #this tells us which is which
## sat.act$gender: 1
## 
## Call:
## lm(formula = ACT ~ SATV, data = x)
## 
## Coefficients:
## (Intercept)         SATV  
##    12.03133      0.02724  
## 
## --------------------------------------------------------------------------- 
## sat.act$gender: 2
## 
## Call:
## lm(formula = ACT ~ SATV, data = x)
## 
## Coefficients:
## (Intercept)         SATV  
##     14.9214       0.0221
text(220,17,"males")
text(220,22,"females")

Some adventures in coding – dummy.code

Sometimes we want to take a categorical variable in which the categories are arbitrary (e.g., college major, ways of coping) and create a dummy code that has the value of 1 for people in that category and 0 for everyone else. Clearly we need at least 3 categories for this to make sense.

For following, we will take the education variable from the spi data set, convert the education levels into 0/1 dummy codes, and then find those items that most correlate with each level of education.

But first, lets examine the dummy.code function

dummy.code  #show the code for the function
## function (x, group = NULL, na.rm = TRUE, top = NULL, min = NULL) 
## {
##     t <- table(x)
##     t <- sort(t, decreasing = TRUE)
##     if (!is.null(min)) {
##         top <- sum(t >= min)
##     }
##     if (is.null(top)) 
##         top <- length(t)
##     t <- t[1:top]
##     lt <- length(t)
##     n.obs <- length(x)
##     if (is.null(group)) {
##         new <- matrix(0, nrow = n.obs, ncol = lt)
##         if (na.rm) {
##             new[is.na(x), ] <- NA
##         }
##         xlev <- factor(x, levels = names(t))
##         for (i in 1:n.obs) {
##             new[i, xlev[i]] <- 1
##         }
##         colnames(new) <- names(t)
##     }
##     else {
##         new <- rep(0, n.obs)
##         xlev <- as.factor(x)
##         if (na.rm) {
##             new[is.na(x)] <- NA
##         }
##         for (i in 1:n.obs) {
##             new[i] <- xlev[i] %in% group
##         }
##     }
##     return(new)
## }
## <bytecode: 0x7fa37df6ef48>
## <environment: namespace:psych>

We can examine each line of this code by trying it.

x <- spi$education
na.rm <- TRUE
t <- table(x)  # Call the table function on x
t   # we do not do this in the function, but we do it here to examine it
## x
##    1    2    3    4    5    6    7    8 
##  394  369 1060  260  123  567  165  392
lt <- length(t)   # how many different categories

Then we create where we want to put the results

 n.obs <- length(x)
 new <- matrix(0,nrow=n.obs,ncol=lt)
  if(na.rm) {new[is.na(x),] <- NA } #added 10/20/17
 headTail(new) #show the first  and last 4 rows
##        X1   X2   X3   X4   X5   X6   X7   X8
## 1    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2       0    0    0    0    0    0    0    0
## 3       0    0    0    0    0    0    0    0
## 4    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## ...   ...  ...  ...  ...  ...  ...  ...  ...
## 3997    0    0    0    0    0    0    0    0
## 3998    0    0    0    0    0    0    0    0
## 3999    0    0    0    0    0    0    0    0
## 4000 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
 xlev <- as.factor(x)   #show the str of this
 str(xlev)  #because we will be using these to index the matrix
##  Factor w/ 8 levels "1","2","3","4",..: NA 6 4 NA 1 4 7 2 8 8 ...
 for (i in 1:n.obs) {new[i,xlev[i]] <- 1}    #Do the work
  headTail(new) #show the first  and last 4 rows
##        X1   X2   X3   X4   X5   X6   X7   X8
## 1    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2       0    0    0    0    0    1    0    0
## 3       0    0    0    1    0    0    0    0
## 4    <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## ...   ...  ...  ...  ...  ...  ...  ...  ...
## 3997    0    0    1    0    0    0    0    0
## 3998    0    1    0    0    0    0    0    0
## 3999    0    0    0    0    0    1    0    0
## 4000 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
  colnames(new) <- names(t)    #if we have names for the categories, they appear here

Now try this function on `spi[“education”]

What are the units of education? We can see them by looking at the help file for the spi data set. Respondents education: less than 12, HS grad, current univ, some univ, associate degree, college degree, in grad/prof, grad/prof degree

edu.codes <- dummy.code(spi$education)
dim(edu.codes)  #we have created the new variables
## [1] 4000    8
describe(edu.codes)   #the means should add up to 1 across all the variables
##   vars    n mean   sd median trimmed mad min max range skew kurtosis   se
## 3    1 3330 0.32 0.47      0    0.27   0   0   1     1 0.78    -1.39 0.01
## 6    2 3330 0.17 0.38      0    0.09   0   0   1     1 1.75     1.08 0.01
## 1    3 3330 0.12 0.32      0    0.02   0   0   1     1 2.36     3.58 0.01
## 8    4 3330 0.12 0.32      0    0.02   0   0   1     1 2.37     3.62 0.01
## 2    5 3330 0.11 0.31      0    0.01   0   0   1     1 2.48     4.14 0.01
## 4    6 3330 0.08 0.27      0    0.00   0   0   1     1 3.14     7.89 0.00
## 7    7 3330 0.05 0.22      0    0.00   0   0   1     1 4.15    15.22 0.00
## 5    8 3330 0.04 0.19      0    0.00   0   0   1     1 4.91    22.10 0.00

Empirical scale construction

Another function, bestScales can use those codes to find which items most correlate with each category. This function finds the items that most correlate with the criterion.

bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary)  #don't use education as a predictor
bs.edu
## 
## Call = bestScales(x = spi[-6], criteria = edu.codes, dictionary = spi.dictionary)
## The items most correlated with the criteria yield r's of 
##   correlation n.items
## 3        0.28       1
## 6        0.22       1
## 1        0.35      10
## 8        0.42       6
## 2        0.13       3
## 4        0.16       4
## 7          NA       0
## 5        0.11       1
## 
## The best items, their correlations and content  are 
## $`3`
##         3 item_id         item item_scale resp_type
## age -0.28     age age in years       <NA>      <NA>
## 
## $`6`
##        6 item_id         item item_scale resp_type
## age 0.22     age age in years       <NA>      <NA>
## 
## $`1`
##            1 item_id                                           item item_scale resp_type
## age    -0.33     age                                   age in years       <NA>      <NA>
## smoke  -0.14   smoke                   smoking (never ... > 20/day)       <NA>      <NA>
## q_578   0.13   q_578                                Dislike myself.       IPIP       reg
## q_1505  0.13  q_1505                                  Panic easily.       IPIP       reg
## q_4296  0.13  q_4296                            Tell a lot of lies.      EPQ:P       reg
## q_952   0.13   q_952                              Get angry easily.       IPIP       reg
## q_4249  0.13  q_4249            Would call myself a nervous person.      EPQ:N       reg
## q_501   0.11   q_501                            Cheat to get ahead.       IPIP       reg
## q_35    0.11    q_35                          Act without thinking.       IPIP       reg
## q_811   0.11   q_811 Feel a sense of worthlessness or hopelessness.       IPIP       reg
## 
## $`8`
##            8 item_id                                   item item_scale resp_type
## age     0.41     age                           age in years       <NA>      <NA>
## q_1132  0.15  q_1132 Have read the great literary classics.       IPIP       reg
## q_1024 -0.13  q_1024             Hang around doing nothing.       IPIP       reg
## q_422   0.12   q_422       Can handle a lot of information.       IPIP       reg
## q_4249 -0.11  q_4249    Would call myself a nervous person.      EPQ:N       reg
## q_1452 -0.10  q_1452                     Neglect my duties.       IPIP       reg
## 
## $`2`
##            2 item_id                                      item item_scale resp_type
## age    -0.12     age                              age in years       <NA>      <NA>
## q_1694 -0.11  q_1694 Set high standards for myself and others.       IPIP       reg
## p1edu  -0.10   p1edu                        Parent 1 education       <NA>      <NA>
## 
## $`4`
##            4 item_id                                    item item_scale resp_type
## smoke   0.16   smoke            smoking (never ... > 20/day)       <NA>      <NA>
## age     0.12     age                            age in years       <NA>      <NA>
## q_2765 -0.11  q_2765                  Am happy with my life.       IPIP       reg
## health -0.11  health self rated health (poor ... excellent)        <NA>      <NA>
## 
## $`7`
## [1] 7 y
## <0 rows> (or 0-length row.names)
## 
## $`5`
##        5 item_id         item item_scale resp_type
## age 0.11     age age in years       <NA>      <NA>

The need to boot strap the solution

The solution just listed is sensitive to sample variations. We can test how sensitive it is by using the boot strap. Boot strapping is basically taking the observed sample as the population and then sampling from that population multiple times (with replacement). This leads to a numer of samples from the ‘population’ that differ to some amount. By examining how much they vary we can get a degree of confidence in our results. We will do the bootstrap 100 times and then examine the pooled results.

bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary[1:2], n.iter=100)
bs.edu
## 
## Call = bestScales(x = spi[-6], criteria = edu.codes, n.iter = 100, dictionary = spi.dictionary[1:2])
##   derivation.mean derivation.sd validation.m validation.sd final.valid final.wtd N.wtd
## 3            0.28         0.014        0.280         0.018        0.28      0.30    10
## 6            0.24         0.024        0.229         0.024        0.22      0.21    10
## 1            0.36         0.012        0.341         0.018        0.33      0.32    10
## 8            0.42         0.018        0.396         0.030        0.42      0.36    10
## 2            0.15         0.026        0.130         0.025        0.12      0.21    10
## 4            0.17         0.032        0.144         0.028        0.16      0.22    10
## 7            0.11         0.012        0.030         0.024          NA      0.14    10
## 5            0.12         0.015        0.098         0.023          NA      0.15    10
## 
##  Best items on each scale with counts of replications
## 
##  Criterion =  3 
##     Freq mean.r sd.r item_id         item
## age  100  -0.28 0.01     age age in years
## 
##  Criterion =  6 
##     Freq mean.r sd.r item_id         item
## age  100   0.22 0.02     age age in years
## 
##  Criterion =  1 
##     Freq mean.r sd.r item_id         item
## age  100  -0.33 0.01     age age in years
## 
##  Criterion =  8 
##        Freq mean.r sd.r item_id                                   item
## age     100   0.41 0.02     age                           age in years
## q_1132  100   0.16 0.02  q_1132 Have read the great literary classics.
## q_1024   95  -0.13 0.02  q_1024             Hang around doing nothing.
## 
##  Criterion =  2 
##     Freq mean.r sd.r item_id         item
## age   95  -0.12 0.01     age age in years
## 
##  Criterion =  4 
##       Freq mean.r sd.r item_id                         item
## smoke  100   0.16 0.02   smoke smoking (never ... > 20/day)
## 
##  Criterion =  7 
##       Freq mean.r sd.r item_id                                   item
## q_871    9  -0.08 0.02   q_871 Feel that most people cant be trusted.
## 
##  Criterion =  5 
##     Freq mean.r sd.r item_id         item
## age   73   0.11 0.02     age age in years