psych has been updated to 2.0.4 Get the latest release from the pmc server. (It has been submitted to CRAN but is not there yet.)
The next line (to install the latest release) has been commented out because I have already updated my version. However, you should do this once.
#install.packages("psych",repos="http://personality-project.org/r",type="source")
library(psych) #make it active
library(psychTools) #this one as well
sessionInfo() #make sure it is there
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] psychTools_2.0.1 psych_2.0.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 lattice_0.20-38 digest_0.6.15 rprojroot_1.3-2 grid_3.6.3 nlme_3.1-144
## [7] backports_1.1.3 magrittr_1.5 evaluate_0.10.1 stringi_1.3.1 rmarkdown_1.9 tools_3.6.3
## [13] foreign_0.8-75 stringr_1.4.0 yaml_2.1.18 parallel_3.6.3 compiler_3.6.3 mnormt_1.5-5
## [19] htmltools_0.3.6 knitr_1.20
news(Version > "2.0.0",package="psych") #read the news for this release
## Changes in version 2.0.3 (2020-04-30)
##
## Additions
##
## o Added crossValidation to be compatible with setCor and
## bestItems
##
## o Added the weights option to statsBy (requested by Joshua
## Tuttle) - still needs to be worked on for weights within groups
##
## o Minor fix to error.dots to allow more control on the figures
## and to make sure that we are calling psych::describe
##
## o Now reports the keys used in scoreItems
##
## o Improved faLookup to handle bassAckward output
##
## o Improved the print and summary from mediate to include the
## intercept term.
##
## o Improved bassAckward.diagram to allow labeling of higher level
## factors.
##
## Bugs fixed
##
## o Fixed summary for setCor to properly print betas (coefficients)
## Error introduced in 1.9.12.31 (reported by Hans Marius Norbom)
##
## o Fixed bug in bassAckward for orthogonal rotations (reported by
## Eva Billen )
##
## o Fixed bug in fa.congruence for case of factor loadings provided
## by user rather than a psych function (reported by Lisa
## Debruine)
##
## o Changed the way to find the pooled correlation in statsBy by
## finding the average fisherZ transformed correlation (back
## transformed to r) rather than just using the average weighted
## r.
##
## o error.bars (which was calling error.bar.by) was not passing all
## the parameters to error.bars.by. (Reported by Lorien Elleman).
##
## o errorCircles was not plotting error circles if there were NAs
## in any of the descriptive statistics\
##
## o Fixed the equamax call to cfT to include kappa in the call
## (reported by David Condon and Gerard Saucier)
##
## Changes in version 2.0.1 (2020-01-13)
##
## Introduction
##
## o Version 1.9.12.xx is the development release of the psych
## package. It is available as a source file for Macs or PCs in
## the repository at <URL: http://personality-project.org/r>. The
## released version on CRAN is 1.9.12. The second digit reflects
## the year (i.e., 2018), the third set the month (i.e., 1.8.3 was
## released in March of 2018, the last two digits of development
## versions reflect either an minor change or the day of any
## modifications, e.g. 1.8.3.3 was the third attempt to get 1.8.3
## released. 1.7.8 was released in August, 2017.
##
## o To install this development version, use the command:
## install.packages("psych",
## repos="http://personality-project.org/r", type="source").
## Remember to restart R and library(psych) to make the new
## version active.
##
## o The psych package includes functions and data sets to do
## classic and modern psychometrics and to analyze personality and
## experimental psychological data sets. The psych package has
## been developed as a supplement to courses in research methods
## in psychology, personality research, and graduate level
## psychometric theory. The functions are a supplement to the text
## (in progress): An introduction to psychometric theory with
## applications in R.
##
## o Additional functions are added sporadically.
##
## o This news file reports changes that have been made as the
## package has been developed.
##
## o To report bugs, send email to <URL:
## mailto:revelle@northwestern.edu> using bug.report.
##
## To do
##
## o Suggestions are welcome, but the current list includes the
## following(and has included for a long time, so lets be
## patient):
##
## o Add confirmatory clustering to ICLUST
##
## o Get cluster scores in ICLUST - analogous to factor scores
## (requested by Ben Shalet)
##
## o Add the ability to create multiple groups in sim.irt functions
##
## o Find canonical loadings in set.cor
##
## o Add omega factor extension figure option (requested by Sylia
## Wilson)
##
## o Add option to do subject density and item density plot to IRT
## plot.
##
## o add the ability to scale radar plots of raw data from min to
## max, and add a scale to radar and spider plots (e.g. circular
## histograms)
##
## o add the ability to add labels to lavaan.diagram
##
## o clean up lavaan diagram so that the output is more readable
##
## o Add the ability to return the true scores for subjects when
## simulating structures. This will help fitting reliability
## models but will require not using mvrnorm
##
## o To help those who want to simulate a bifactor model, make it
## explicit
##
## o Check the bias in bootstrap resampling using cor.ci
##
## o Start to use the drop=FALSE when doing subsetting (doing this
## more and more)
##
## o add symmetric=TRUE to eigen calls, think about only.values=TRUE
## for some cases
##
## o Add an analysis if DIF to the irt functions (requested by David
## Condon)
##
## o Add some power functions
##
## o Add CIs for means and skews as an option (for describe as well
## as describeBy) Requested by Aaron Wichman
##
## o Fix statsBy to handle the case of NULL groups. (particularly
## for the cors=TRUE option).
##
## o Add a function for the Meng, Rosenthal, Rubin tests for
## multiple comparisons of correlations.
##
## o Clean up the various sim functions so that they are better
## documented
##
## o Clean up documentation of score.irt etc. (partly done)
##
## o Add complexity to omega solution
##
## o Add statistic to fa to report max fa. (partly done with fm =
## minrank)
##
## o Parallize iterations for confidence intervals in omega
##
## o Probably should drop much of the stats when doing iterations
## with fa or omega
##
## o Find unbiased cis when doing bootstrap
##
## o Allow the choice between regression vs. correlation in esem
## between X and Y sets.
##
## o Allow for different rotation options in X and Y sets in esem,
## allow omega style hierarchical structure for x (and y)
##
## o Fix bug in esem.diagram for ability- bifactor 3,3
## simple=FALSE,cut=.2
##
## o Improve documentation for esem and cosinor
##
## o Add interbattery diagram to the structure.
##
## o fix bug in lavaan.diagram for lr=FALSE
##
## o Add Krippendorf's alpha?
##
## o Add the ability to weight factor residuals to allow WLMS
##
## o Add the ability to get factor scores for higher order factors
## (requested by Brandon Vaughn)
##
## o Create a reliabilities function to combine omega and alpha and
## splitHalf
##
## o Add the Hull method for number of factors
##
## o Add an option for the correlation type in bestScales (requested
## by Lorien Elleman).
##
## o Change help file examples so they are cleaner for people who
## example(fn)
##
## o Make the main functions more "tidy" in terms of their output
##
## o Add testthat functionality
##
## o Add parse capabilities to error.bars, error.dots, spider, etc.
## so that grouping variables can be specified in formula mode.
##
## o Consider creating a new package (psych.tools) to add some of
## the useful utilities.
##
## o Possible bug in cohen.kappa confidence intervals (reported by
## Marco Fornili)
##
## o Need to add documentation to mediate/moderate/setCor for what
## dfs are used and why.
##
## o Examine the stats::anova, stats::summary.lm, stats::anova.lm
## for hints on how to do these anova comparisons
##
## o Add an n.iter option to pca (requested by Michael Wood)
##
## o There is a bug in bestItems for the case of no iterations and
## dictionary = something
##
## o Fix bug in fa.lookup for omega g values that are negative
##
## o Fix bug in bestScales that does not report items if not
## iterating
##
## o Adjust stats in fa.extend to report the raw and then the
## extended values
##
## o Improve documentation for extension diagram to clarify that it
## works with fa.results
##
## o Add s.e. to intercept in set cor (requested by Eric Youngstrom)
##
## o Add just check variables specified for being numeric (instead
## of entire data.frame) (suggested by Fransisco Wilheim)
##
## o Pretty up the output from mediate (perhaps make a mediate2latex
## function
##
## o There is a bug in setCor for finding the intercepts
##
## o Add a anova.test for two setCor objects
##
## o Add formula input to densityBy
##
## o Improve the print function for bassAckwards
##
## o Cross reference fa.multi with other functions
##
## o Problem with scoreIrt.1pl with keys of length 2
##
## o Change examples to dontrun rather than commented out
##
## o Minor change in msqR help file
##
## o Minor change to summary for testRetest
##
## o add formula input to densityBy and violinBy
##
## o Check the raw df in esem
##
## o Add RMSEA, TFI, etc. to esem (requested by Gabe Orona )
##
## o Add formula mode to violinBy and the errorbarsBy, etc.
##
## o Add formula mode to lowerCor and lowerMat (to be able to choose
## a few variables in a data set)
##
## o Need to fix bug in alpha where it finds R from cov2cor instead
## of directly (reported by Karl Ove Hufthammer )
##
## o Look at ICC3 in responde to query from Fred Oswald
##
## o Add a random restart option for various rotation solutions
##
## o Fix bug in fa or fa.stats reported by Kyle Husman (something is
## wrong with promax solutions)
##
## o Fix passing parameter to Equamax (reported by Erich Studerus)
##
## o Add a warning when doing omegaSem with one factor (or at least
## make it work - reported by Erich Studerus)
##
## o Fix a bug in testRetest (reported by Miquel Biron)
##
## o Fix a bug in setCor summary function (reported by Hans Marius
## Norbom)
##
## o Fix a bug in alpha when finding averager R (need to use cor
## rather than cov2cor when treating missing data)
##
## o Fix alpha and scoreItems to think more carefully about what to
## do if an item has no variance. Score it anyway, just don't use
## it for item statistics. (raised by Eric Green).
##
## o Fix a bug in polychoric (reported with suggested patch by Ben
## Mainwaring.)
##
## o Fix a bug in testRetest (reported by Miguel Biron-Lattes )
##
## o Improve the help file for omegaFromSem and the error handling
## to quit nicely when asked for omega of a single factor.
##
## o Add weight option to statsBy (but the question is how to handle
## missing data)
There are many forms of the linear model.
\[\hat{y} = b_1x + e\] is the classic regression model. If x is a dichotomous variable, this is just a t-test. If x is categorical, then we can do this as an analysis of variance. In this case, we think of x causing y. The slope (b) is the ratio of the covariance of x and y divided by the variance of x.
\[b_1 = \frac{cov_{xy}}{var_x}\]
\[= \frac{\sigma_{xy}}{\sigma_x^2} \]
If we think of y causing x, this becomes:
\[\hat{x} = b_1y + e\] That is, y causes x.
If we are unsure of the direction of causality, we can find the geometric average of the two regressions and find
\[r_{xy} = \frac{cov_{xy}}{\sqrt{\sigma^2_x \sigma^2_y}} = cov_{z_x z_y}\] which is the ratio of the covariance between x and y and the square root of the product of the two variances, or alternatively, the covariances of the standard scores.
If we have more than one predictor, we have Multiple regression
\[\hat{y} = b_1 x_1 + b_2 x_2 + e\] If \(x_1\) and \(x_2\) are categorical, this is also known as an analysis of variance.
First, just show the npk dataset from core R. Can you see any relationships here?
npk #The data come from Venables
## block N P K yield
## 1 1 0 1 1 49.5
## 2 1 1 1 0 62.8
## 3 1 0 0 0 46.8
## 4 1 1 0 1 57.0
## 5 2 1 0 0 59.8
## 6 2 1 1 1 58.5
## 7 2 0 0 1 55.5
## 8 2 0 1 0 56.0
## 9 3 0 1 0 62.8
## 10 3 1 1 1 55.8
## 11 3 1 0 0 69.5
## 12 3 0 0 1 55.0
## 13 4 1 0 0 62.0
## 14 4 1 1 1 48.8
## 15 4 0 0 1 45.5
## 16 4 0 1 0 44.2
## 17 5 1 1 0 52.0
## 18 5 0 0 0 51.5
## 19 5 1 0 1 49.8
## 20 5 0 1 1 48.8
## 21 6 1 0 1 57.2
## 22 6 1 1 0 59.0
## 23 6 0 1 1 53.2
## 24 6 0 0 0 56.0
As we do for any data set, we want to describe it and perhaps examine the pairwise relationships.
describe(npk)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## block* 1 24 3.50 1.74 3.50 3.50 2.22 1.0 6.0 5.0 0.00 -1.41 0.36
## N* 2 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## P* 3 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## K* 4 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## yield 5 24 54.88 6.17 55.65 54.75 6.15 44.2 69.5 25.3 0.24 -0.51 1.26
pairs.panels(npk)
When we describe the data, the variables are marked by an * which indicates that they are not numeric (they are actuall Factors with levels), but are treated by describe as numeric. We can examine the str (structure) of the data to see this.
The pairs.panels function automatically converts Factor or Character values to numeric.
str(npk)
## 'data.frame': 24 obs. of 5 variables:
## $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ N : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
## $ P : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
## $ K : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
## $ yield: num 49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
Examining the correlations of the npk data we see that there are four experimental variables (IVs) and one dependent variable (DV). Because this is a balanced experiment, we see that the correlation of the IVs are all exactly zero and that they have different relationships with the DV.
Standardized regressions are just correlations. We can show the relationships as a path model using the setCor
function. By default, we show the standardized solution which expresses everything as correlations (Left panel) However we may also say that we do not want standardized variables (right panel) to see the raw variance components.
par(mfrow=c(1,2)) #set up a two panel graph
mod0 <- setCor(yield ~ N + P + K, data=npk,main="Standardized Regression")#show this as a figure
mod0 <- setCor(yield ~ N + P + K, data=npk,std=FALSE,main="Unstandardized Regression") #do not standardize
par(mfrow=c(1,1)) #set it back to a one panel graph
If the IV is categorical and has just two levels, we can examine the effect of the IV on the DV by doing a t.test
t.test(yield ~ K, data = npk)
##
## Welch Two Sample t-test
##
## data: yield by K
## t = 1.6374, df = 17.621, p-value = 0.1193
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.135414 9.102081
## sample estimates:
## mean in group 0 mean in group 1
## 56.86667 52.88333
The t-test is a special case of the linear model (in this case the X variable just has two values.) We can do the same analysis as the t-test but use the linear model function lm.
model <- lm(yield ~ K, data = npk)
summary(model)
##
## Call:
## lm(formula = yield ~ K, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.667 -4.083 1.217 4.167 12.633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.867 1.720 33.059 <2e-16 ***
## K1 -3.983 2.433 -1.637 0.116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.959 on 22 degrees of freedom
## Multiple R-squared: 0.1086, Adjusted R-squared: 0.06812
## F-statistic: 2.681 on 1 and 22 DF, p-value: 0.1158
There are several ways of showing this result. We can do an error.bars
error.bars.by(yield ~ K,data=npk)
Or we can use the setCor function to do the linear model and draw the graph. Unfortunately, there is a bug (just discovevered) for the case of a single IV in setCor and we need to do a work around. Note that the t statistic given for the regression slope is the same as the t statistic when doing the normal t-test.
tempR <- lowerCor(levels2numeric(npk)) #convert to numeric
## block N P K yield
## block 1.00
## N 0.00 1.00
## P 0.00 0.00 1.00
## K 0.00 0.00 0.00 1.00
## yield -0.16 0.46 -0.10 -0.33 1.00
setCor(yield ~ K, data = tempR, n.obs=24) #do the regression from the correlations
## Call: setCor(y = yield ~ K, data = tempR, n.obs = 24)
##
## Multiple Regression from matrix input
##
## DV = yield
## slope se t p lower.ci upper.ci VIF
## K -0.33 0.2 -1.64 0.12 -0.75 0.09 1
##
## Residual Standard Error = 0.97 with 22 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## yield 0.33 0.11 0.33 0.11 0.07 0.1 2.68 1 22 0.116
Analysis of variance is a special case of the linear model. We can see this when we compare two anovas with two multiple regresssions. The first is just a single Independent Variable (IV) while the second one uses all three predictors. The last analysis (using setCor) does the same as the lm but draws a path model as well.
mod1 <- aov(yield ~ N,data=npk)
lm1 <- lm(yield ~ N, data=npk)
summary(mod1)
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.061 0.0221 *
## Residuals 22 687.1 31.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm1) #identical to the anova (in terms of p values, ts are sqrts of Fs)
##
## Call:
## lm(formula = yield ~ N, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8833 -3.7667 0.1667 3.5583 11.8167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.067 1.613 32.274 <2e-16 ***
## N1 5.617 2.281 2.462 0.0221 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.588 on 22 degrees of freedom
## Multiple R-squared: 0.216, Adjusted R-squared: 0.1803
## F-statistic: 6.061 on 1 and 22 DF, p-value: 0.02213
mod2 <- aov(yield ~ N + P + K,data=npk)
lm2 <- lm(yield ~ N + P + K,data=npk)
summary(mod2)
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.488 0.0192 *
## P 1 8.4 8.40 0.288 0.5974
## K 1 95.2 95.20 3.263 0.0859 .
## Residuals 20 583.5 29.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm2) #also identical in terms of p values
##
## Call:
## lm(formula = yield ~ N + P + K, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2667 -3.6542 0.7083 3.4792 9.3333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.650 2.205 24.784 <2e-16 ***
## N1 5.617 2.205 2.547 0.0192 *
## P1 -1.183 2.205 -0.537 0.5974
## K1 -3.983 2.205 -1.806 0.0859 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.401 on 20 degrees of freedom
## Multiple R-squared: 0.3342, Adjusted R-squared: 0.2343
## F-statistic: 3.346 on 3 and 20 DF, p-value: 0.0397
#now do the analysis but draw a path model as well
setCor(yield ~ N + P + K,data=npk,main = "Regression Model using setCor")
## Call: setCor(y = yield ~ N + P + K, data = npk, main = "Regression Model using setCor")
##
## Multiple Regression from raw data
##
## DV = yield
## slope se t p lower.ci upper.ci VIF
## (Intercept) 0.00 0.18 0.00 1.000 -0.38 0.38 1
## N 0.46 0.18 2.55 0.019 0.08 0.85 1
## P -0.10 0.18 -0.54 0.600 -0.48 0.28 1
## K -0.33 0.18 -1.81 0.086 -0.71 0.05 1
##
## Residual Standard Error = 0.88 with 20 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## yield 0.58 0.33 0.52 0.27 0.23 0.12 3.35 3 20 0.0397
Note that although the output looks very different, the t value’s of the path coefficients are the same as the square roots of the F values in the aov model.
Regression is most appropriate with continuous predictors. We do this for one predictor (SATV) and one DV (ACT). We show the data points, fit with the best fitting linear model, and then show the model.
plot(ACT ~ SATV, data= sat.act) #show the points
abline(lm(ACT ~ SATV, data=sat.act)) #show the best fitting line
summary(lm(ACT ~ SATV, data=sat.act)) #summarize the results
##
## Call:
## lm(formula = ACT ~ SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.666 -2.454 -0.031 2.548 16.334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.871813 0.833334 16.65 <2e-16 ***
## SATV 0.023970 0.001339 17.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.996 on 698 degrees of freedom
## Multiple R-squared: 0.3148, Adjusted R-squared: 0.3138
## F-statistic: 320.7 on 1 and 698 DF, p-value: < 2.2e-16
But typically, we have more predictors. Although showing the individual regressions is possible (and we do as a two panel graph) we can also summarize them graphically by a path model. (As we did earlier). When doing regressions, the slopes are in the units of the predictor (e.g. SATV), and are sometimes hard to interpret. Thus, we can express the slopes as standardized values to make them more understandable.
par(mfrow=c(1,2)) #create a two panel graph
plot(ACT ~ SATV, data= sat.act) #show the points
abline(lm(ACT ~ SATV, data=sat.act))
plot(ACT ~ SATQ, data= sat.act) #show the points
abline(lm(ACT ~ SATQ, data=sat.act))
#do another 2 panel graph We supress the actual output
mod3 <- setCor(ACT ~ SATV + SATQ, data=sat.act,main="Standardized (Correlations)")
mod4 <- setCor(ACT ~ SATV + SATQ, data=sat.act, std=FALSE,main="unstandardized")
par(mfrow=c(1,1)) #convert back to the normal 1 panel graph
summary(lm(ACT ~ SATV + SATQ, data=sat.act))
##
## Call:
## lm(formula = ACT ~ SATV + SATQ, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6795 -2.3076 0.0831 2.1924 18.5489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.565946 0.854058 12.371 < 2e-16 ***
## SATV 0.013284 0.001649 8.054 3.56e-15 ***
## SATQ 0.016142 0.001616 9.990 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.743 on 684 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.4015, Adjusted R-squared: 0.3997
## F-statistic: 229.4 on 2 and 684 DF, p-value: < 2.2e-16
An important question is whether one variable affects the relationship between another IV and DV. In analysis of variance terms, this is asking if there is an interaction. The standard anova approach just does this by including the interaction term (shown as a product in the formula). But when we try this with regression, things get more complicated.
mod2 <- aov(yield ~ N * P * K,data = npk)
summary(mod2)
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.161 0.0245 *
## P 1 8.4 8.40 0.273 0.6082
## K 1 95.2 95.20 3.099 0.0975 .
## N:P 1 21.3 21.28 0.693 0.4175
## N:K 1 33.1 33.14 1.078 0.3145
## P:K 1 0.5 0.48 0.016 0.9019
## N:P:K 1 37.0 37.00 1.204 0.2887
## Residuals 16 491.6 30.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But when do this as a linear model, it doesn’t match! Compare the p values. Only the 3 way interaction is correct.
summary(lm(yield ~ N * P * K,data = npk))
##
## Call:
## lm(formula = yield ~ N * P * K, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.133 -4.133 1.250 3.125 8.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.4333 3.2002 16.072 2.7e-11 ***
## N1 12.3333 4.5258 2.725 0.015 *
## P1 2.9000 4.5258 0.641 0.531
## K1 0.5667 4.5258 0.125 0.902
## N1:P1 -8.7333 6.4004 -1.365 0.191
## N1:K1 -9.6667 6.4004 -1.510 0.150
## P1:K1 -4.4000 6.4004 -0.687 0.502
## N1:P1:K1 9.9333 9.0515 1.097 0.289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.543 on 16 degrees of freedom
## Multiple R-squared: 0.4391, Adjusted R-squared: 0.1937
## F-statistic: 1.789 on 7 and 16 DF, p-value: 0.1586
The problem is that the product terms are correlated with the main effects. We can show this by forming the product terms. Although tedious, it is possible to find the 4 product terms. We do this, and then put the results into a data.frame which we when plot. Note how the product terms are correlated with their elements. This makes sense, but is wrong if we want to do the modeling.
First we create a new data.frame (NPK) which is made up of the numeric equivalents of what was in npk.
NPK <- char2numeric(npk)
describe(npk) #the original data are not technically numeric, they are factors
## vars n mean sd median trimmed mad min max range skew kurtosis se
## block* 1 24 3.50 1.74 3.50 3.50 2.22 1.0 6.0 5.0 0.00 -1.41 0.36
## N* 2 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## P* 3 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## K* 4 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## yield 5 24 54.88 6.17 55.65 54.75 6.15 44.2 69.5 25.3 0.24 -0.51 1.26
describe(NPK) #we have converted them to numeric
## vars n mean sd median trimmed mad min max range skew kurtosis se
## block 1 24 3.50 1.74 3.50 3.50 2.22 1.0 6.0 5.0 0.00 -1.41 0.36
## N 2 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## P 3 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## K 4 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## yield 5 24 54.88 6.17 55.65 54.75 6.15 44.2 69.5 25.3 0.24 -0.51 1.26
Now show the correlations of the product terms and the main effects. We use the gap
parameter in pairs.panels
to show a slightly more compact graphic.
NP <- NPK$N * NPK$P
NK <- NPK$N * NPK$K
PK <- NPK$P * NPK$K
PKN <- PK * NPK$N
NPK.prods <- data.frame(NPK,NP,NK,PK,PKN)
pairs.panels(NPK.prods,gap=0) #show the correlations, tighten up the figure
The simple products are correlated with each other, and with the basic factors. To get around this problem, we center the data using the scale
function. scale
will center and if we set scale=TRUE it will also standardize (divide by the standard deviation). The default is to standardize, so we specify that it is false. We also need to convert the output of the scale
function from a matrix to a data.frame.
centered.NPK <- scale(NPK,scale=FALSE)
centered.NPK <- data.frame(scale(NPK,scale=FALSE))
c.NP <- centered.NPK$N * centered.NPK$P
c.NK <- centered.NPK$N * centered.NPK$K
c.PK <- centered.NPK$P * centered.NPK$K
c.PKN <- c.PK * centered.NPK$N
center.prod <- data.frame(centered.NPK,c.NP,c.NK,c.PK,c.PKN)
pairs.panels(center.prod) #show the results grapically
lowerCor(center.prod) #show it a correlation matrix as well
## block N P K yield c.NP c.NK c.PK c.PKN
## block 1.00
## N 0.00 1.00
## P 0.00 0.00 1.00
## K 0.00 0.00 0.00 1.00
## yield -0.16 0.46 -0.10 -0.33 1.00
## c.NP 0.00 0.00 0.00 0.00 -0.16 1.00
## c.NK 0.00 0.00 0.00 0.00 -0.19 0.00 1.00
## c.PK 0.00 0.00 0.00 0.00 0.02 0.00 0.00 1.00
## c.PKN -0.29 0.00 0.00 0.00 0.21 0.00 0.00 0.00 1.00
Note the products are now uncorrelated. If we now use the linear model function, it agrees with our anova output.
lm.mod <- lm(yield ~ N * P * K, data = centered.NPK) #do the analysis on the centered data
summary(lm.mod) #show the moderated multiple regression
##
## Call:
## lm(formula = yield ~ N * P * K, data = centered.NPK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.133 -4.133 1.250 3.125 8.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.269e-15 1.131e+00 0.000 1.0000
## N 5.617e+00 2.263e+00 2.482 0.0245 *
## P -1.183e+00 2.263e+00 -0.523 0.6082
## K -3.983e+00 2.263e+00 -1.760 0.0975 .
## N:P -3.767e+00 4.526e+00 -0.832 0.4175
## N:K -4.700e+00 4.526e+00 -1.038 0.3145
## P:K 5.667e-01 4.526e+00 0.125 0.9019
## N:P:K 9.933e+00 9.052e+00 1.097 0.2887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.543 on 16 degrees of freedom
## Multiple R-squared: 0.4391, Adjusted R-squared: 0.1937
## F-statistic: 1.789 on 7 and 16 DF, p-value: 0.1586
summary(mod2) #compare with the original anova output
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.161 0.0245 *
## P 1 8.4 8.40 0.273 0.6082
## K 1 95.2 95.20 3.099 0.0975 .
## N:P 1 21.3 21.28 0.693 0.4175
## N:K 1 33.1 33.14 1.078 0.3145
## P:K 1 0.5 0.48 0.016 0.9019
## N:P:K 1 37.0 37.00 1.204 0.2887
## Residuals 16 491.6 30.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Once again, to the analysis correctly, we need to center the data.
cen.sat.act <- data.frame(scale(sat.act,scale=FALSE))
lm.mod2 <- lm(ACT ~ gender * SATV , data= cen.sat.act)
summary(lm.mod2)
##
## Call:
## lm(formula = ACT ~ gender * SATV, data = cen.sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4788 -2.3916 -0.0332 2.4952 15.6586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.005219 0.150824 -0.035 0.9724
## gender -0.255094 0.315645 -0.808 0.4193
## SATV 0.023913 0.001337 17.886 <2e-16 ***
## gender:SATV -0.005137 0.002785 -1.844 0.0655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.99 on 696 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.3159
## F-statistic: 108.6 on 3 and 696 DF, p-value: < 2.2e-16
There is a suggestion of an interaction between gender and SATV. To see this graphically, we draw the data (using a different plot character for males and females) and then draw two lines, one for males, and one for females. The interaction is a difference in the slopes of these regressions.
plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==1,]))
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==2,]),lty="dashed")
# or another way of doing this is with the by statment
plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
by(sat.act,sat.act$gender, function(x) abline(lm(ACT ~ SATV, data=x),lty=c("solid","dashed")[x$gender]))
## sat.act$gender: 1
## NULL
## ---------------------------------------------------------------------------
## sat.act$gender: 2
## NULL
by(sat.act,sat.act$gender,function(x) lm(ACT ~ SATV, data=x)) #this tells us which is which
## sat.act$gender: 1
##
## Call:
## lm(formula = ACT ~ SATV, data = x)
##
## Coefficients:
## (Intercept) SATV
## 12.03133 0.02724
##
## ---------------------------------------------------------------------------
## sat.act$gender: 2
##
## Call:
## lm(formula = ACT ~ SATV, data = x)
##
## Coefficients:
## (Intercept) SATV
## 14.9214 0.0221
text(220,17,"males")
text(220,22,"females")
Sometimes we want to take a categorical variable in which the categories are arbitrary (e.g., college major, ways of coping) and create a dummy code that has the value of 1 for people in that category and 0 for everyone else. Clearly we need at least 3 categories for this to make sense.
For following, we will take the education
variable from the spi
data set, convert the education levels into 0/1 dummy codes, and then find those items that most correlate with each level of education.
dummy.code
functiondummy.code #show the code for the function
## function (x, group = NULL, na.rm = TRUE, top = NULL, min = NULL)
## {
## t <- table(x)
## t <- sort(t, decreasing = TRUE)
## if (!is.null(min)) {
## top <- sum(t >= min)
## }
## if (is.null(top))
## top <- length(t)
## t <- t[1:top]
## lt <- length(t)
## n.obs <- length(x)
## if (is.null(group)) {
## new <- matrix(0, nrow = n.obs, ncol = lt)
## if (na.rm) {
## new[is.na(x), ] <- NA
## }
## xlev <- factor(x, levels = names(t))
## for (i in 1:n.obs) {
## new[i, xlev[i]] <- 1
## }
## colnames(new) <- names(t)
## }
## else {
## new <- rep(0, n.obs)
## xlev <- as.factor(x)
## if (na.rm) {
## new[is.na(x)] <- NA
## }
## for (i in 1:n.obs) {
## new[i] <- xlev[i] %in% group
## }
## }
## return(new)
## }
## <bytecode: 0x7fa37df6ef48>
## <environment: namespace:psych>
We can examine each line of this code by trying it.
x <- spi$education
na.rm <- TRUE
t <- table(x) # Call the table function on x
t # we do not do this in the function, but we do it here to examine it
## x
## 1 2 3 4 5 6 7 8
## 394 369 1060 260 123 567 165 392
lt <- length(t) # how many different categories
Then we create where we want to put the results
n.obs <- length(x)
new <- matrix(0,nrow=n.obs,ncol=lt)
if(na.rm) {new[is.na(x),] <- NA } #added 10/20/17
headTail(new) #show the first and last 4 rows
## X1 X2 X3 X4 X5 X6 X7 X8
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## ... ... ... ... ... ... ... ... ...
## 3997 0 0 0 0 0 0 0 0
## 3998 0 0 0 0 0 0 0 0
## 3999 0 0 0 0 0 0 0 0
## 4000 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
xlev <- as.factor(x) #show the str of this
str(xlev) #because we will be using these to index the matrix
## Factor w/ 8 levels "1","2","3","4",..: NA 6 4 NA 1 4 7 2 8 8 ...
for (i in 1:n.obs) {new[i,xlev[i]] <- 1} #Do the work
headTail(new) #show the first and last 4 rows
## X1 X2 X3 X4 X5 X6 X7 X8
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 0 0 0 0 0 1 0 0
## 3 0 0 0 1 0 0 0 0
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## ... ... ... ... ... ... ... ... ...
## 3997 0 0 1 0 0 0 0 0
## 3998 0 1 0 0 0 0 0 0
## 3999 0 0 0 0 0 1 0 0
## 4000 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
colnames(new) <- names(t) #if we have names for the categories, they appear here
Now try this function on `spi[“education”]
What are the units of education? We can see them by looking at the help file for the spi data set. Respondents education: less than 12, HS grad, current univ, some univ, associate degree, college degree, in grad/prof, grad/prof degree
edu.codes <- dummy.code(spi$education)
dim(edu.codes) #we have created the new variables
## [1] 4000 8
describe(edu.codes) #the means should add up to 1 across all the variables
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 3 1 3330 0.32 0.47 0 0.27 0 0 1 1 0.78 -1.39 0.01
## 6 2 3330 0.17 0.38 0 0.09 0 0 1 1 1.75 1.08 0.01
## 1 3 3330 0.12 0.32 0 0.02 0 0 1 1 2.36 3.58 0.01
## 8 4 3330 0.12 0.32 0 0.02 0 0 1 1 2.37 3.62 0.01
## 2 5 3330 0.11 0.31 0 0.01 0 0 1 1 2.48 4.14 0.01
## 4 6 3330 0.08 0.27 0 0.00 0 0 1 1 3.14 7.89 0.00
## 7 7 3330 0.05 0.22 0 0.00 0 0 1 1 4.15 15.22 0.00
## 5 8 3330 0.04 0.19 0 0.00 0 0 1 1 4.91 22.10 0.00
Another function, bestScales
can use those codes to find which items most correlate with each category. This function finds the items that most correlate with the criterion.
bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary) #don't use education as a predictor
bs.edu
##
## Call = bestScales(x = spi[-6], criteria = edu.codes, dictionary = spi.dictionary)
## The items most correlated with the criteria yield r's of
## correlation n.items
## 3 0.28 1
## 6 0.22 1
## 1 0.35 10
## 8 0.42 6
## 2 0.13 3
## 4 0.16 4
## 7 NA 0
## 5 0.11 1
##
## The best items, their correlations and content are
## $`3`
## 3 item_id item item_scale resp_type
## age -0.28 age age in years <NA> <NA>
##
## $`6`
## 6 item_id item item_scale resp_type
## age 0.22 age age in years <NA> <NA>
##
## $`1`
## 1 item_id item item_scale resp_type
## age -0.33 age age in years <NA> <NA>
## smoke -0.14 smoke smoking (never ... > 20/day) <NA> <NA>
## q_578 0.13 q_578 Dislike myself. IPIP reg
## q_1505 0.13 q_1505 Panic easily. IPIP reg
## q_4296 0.13 q_4296 Tell a lot of lies. EPQ:P reg
## q_952 0.13 q_952 Get angry easily. IPIP reg
## q_4249 0.13 q_4249 Would call myself a nervous person. EPQ:N reg
## q_501 0.11 q_501 Cheat to get ahead. IPIP reg
## q_35 0.11 q_35 Act without thinking. IPIP reg
## q_811 0.11 q_811 Feel a sense of worthlessness or hopelessness. IPIP reg
##
## $`8`
## 8 item_id item item_scale resp_type
## age 0.41 age age in years <NA> <NA>
## q_1132 0.15 q_1132 Have read the great literary classics. IPIP reg
## q_1024 -0.13 q_1024 Hang around doing nothing. IPIP reg
## q_422 0.12 q_422 Can handle a lot of information. IPIP reg
## q_4249 -0.11 q_4249 Would call myself a nervous person. EPQ:N reg
## q_1452 -0.10 q_1452 Neglect my duties. IPIP reg
##
## $`2`
## 2 item_id item item_scale resp_type
## age -0.12 age age in years <NA> <NA>
## q_1694 -0.11 q_1694 Set high standards for myself and others. IPIP reg
## p1edu -0.10 p1edu Parent 1 education <NA> <NA>
##
## $`4`
## 4 item_id item item_scale resp_type
## smoke 0.16 smoke smoking (never ... > 20/day) <NA> <NA>
## age 0.12 age age in years <NA> <NA>
## q_2765 -0.11 q_2765 Am happy with my life. IPIP reg
## health -0.11 health self rated health (poor ... excellent) <NA> <NA>
##
## $`7`
## [1] 7 y
## <0 rows> (or 0-length row.names)
##
## $`5`
## 5 item_id item item_scale resp_type
## age 0.11 age age in years <NA> <NA>
The solution just listed is sensitive to sample variations. We can test how sensitive it is by using the boot strap. Boot strapping is basically taking the observed sample as the population and then sampling from that population multiple times (with replacement). This leads to a numer of samples from the ‘population’ that differ to some amount. By examining how much they vary we can get a degree of confidence in our results. We will do the bootstrap 100 times and then examine the pooled results.
bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary[1:2], n.iter=100)
bs.edu
##
## Call = bestScales(x = spi[-6], criteria = edu.codes, n.iter = 100, dictionary = spi.dictionary[1:2])
## derivation.mean derivation.sd validation.m validation.sd final.valid final.wtd N.wtd
## 3 0.28 0.014 0.280 0.018 0.28 0.30 10
## 6 0.24 0.024 0.229 0.024 0.22 0.21 10
## 1 0.36 0.012 0.341 0.018 0.33 0.32 10
## 8 0.42 0.018 0.396 0.030 0.42 0.36 10
## 2 0.15 0.026 0.130 0.025 0.12 0.21 10
## 4 0.17 0.032 0.144 0.028 0.16 0.22 10
## 7 0.11 0.012 0.030 0.024 NA 0.14 10
## 5 0.12 0.015 0.098 0.023 NA 0.15 10
##
## Best items on each scale with counts of replications
##
## Criterion = 3
## Freq mean.r sd.r item_id item
## age 100 -0.28 0.01 age age in years
##
## Criterion = 6
## Freq mean.r sd.r item_id item
## age 100 0.22 0.02 age age in years
##
## Criterion = 1
## Freq mean.r sd.r item_id item
## age 100 -0.33 0.01 age age in years
##
## Criterion = 8
## Freq mean.r sd.r item_id item
## age 100 0.41 0.02 age age in years
## q_1132 100 0.16 0.02 q_1132 Have read the great literary classics.
## q_1024 95 -0.13 0.02 q_1024 Hang around doing nothing.
##
## Criterion = 2
## Freq mean.r sd.r item_id item
## age 95 -0.12 0.01 age age in years
##
## Criterion = 4
## Freq mean.r sd.r item_id item
## smoke 100 0.16 0.02 smoke smoking (never ... > 20/day)
##
## Criterion = 7
## Freq mean.r sd.r item_id item
## q_871 9 -0.08 0.02 q_871 Feel that most people cant be trusted.
##
## Criterion = 5
## Freq mean.r sd.r item_id item
## age 73 0.11 0.02 age age in years