psych has been updated to 2.2.4 Get the latest release from CRAN
The next line (to install the latest release) has been commented out because I have already updated my version. However, you should do this once.
#install.packages("psych",repos="http://personality-project.org/r",type="source")
library(psych) #make it active
library(psychTools) #make this active as well
sessionInfo() #make sure it is there
## R version 4.4.0 beta (2024-04-12 r86412)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] psychTools_2.4.4 psych_2.4.4
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-164 cli_3.6.1 knitr_1.43 rlang_1.1.1 xfun_0.39
## [6] jsonlite_1.8.5 rtf_0.4-14.1 htmltools_0.5.8.1 sass_0.4.9 rmarkdown_2.26
## [11] grid_4.4.0 evaluate_0.21 jquerylib_0.1.4 fastmap_1.1.1 yaml_2.3.7
## [16] lifecycle_1.0.3 compiler_4.4.0 rstudioapi_0.16.0 R.oo_1.26.0 lattice_0.22-6
## [21] digest_0.6.31 R6_2.5.1 foreign_0.8-86 mnormt_2.1.1 parallel_4.4.0
## [26] bslib_0.7.0 R.methodsS3_1.8.2 tools_4.4.0 cachem_1.0.8
#news(Version > "2.2.4",package="psych") #read the news for this release
packageDate(pkg="psych") #should be April 23
## [1] "2024-04-23"
There are many forms of the linear model.
\[\hat{y} = b_1x + e\] is the classic regression model. If x is a dichotomous variable, this is just a t-test. If x is categorical, then we can do this as an analysis of variance. In this case, we think of x causing y. The slope (b) is the ratio of the covariance of x and y divided by the variance of x.
\[b_1 = \frac{cov_{xy}}{var_x}\]
\[= \frac{\sigma_{xy}}{\sigma_x^2} \]
If we think of y causing x, this becomes:
\[\hat{x} = b_1y + e\] That is, y causes x.
If we are unsure of the direction of causality, we can find the geometric average of the two regressions and find
\[r_{xy} = \frac{cov_{xy}}{\sqrt{\sigma^2_x \sigma^2_y}} = cov_{z_x z_y}\] which is the ratio of the covariance between x and y and the square root of the product of the two variances, or alternatively, the covariances of the standard scores.
If we have more than one predictor, we have Multiple regression
\[\hat{y} = b_1 x_1 + b_2 x_2 + e\] If \(x_1\) and \(x_2\) are categorical, this is also known as an analysis of variance.
First, just show the npk dataset from core R. Can you see any relationships here?
npk #The data come from Venables
## block N P K yield
## 1 1 0 1 1 49.5
## 2 1 1 1 0 62.8
## 3 1 0 0 0 46.8
## 4 1 1 0 1 57.0
## 5 2 1 0 0 59.8
## 6 2 1 1 1 58.5
## 7 2 0 0 1 55.5
## 8 2 0 1 0 56.0
## 9 3 0 1 0 62.8
## 10 3 1 1 1 55.8
## 11 3 1 0 0 69.5
## 12 3 0 0 1 55.0
## 13 4 1 0 0 62.0
## 14 4 1 1 1 48.8
## 15 4 0 0 1 45.5
## 16 4 0 1 0 44.2
## 17 5 1 1 0 52.0
## 18 5 0 0 0 51.5
## 19 5 1 0 1 49.8
## 20 5 0 1 1 48.8
## 21 6 1 0 1 57.2
## 22 6 1 1 0 59.0
## 23 6 0 1 1 53.2
## 24 6 0 0 0 56.0
As we do for any data set, we want to describe it and perhaps examine the pairwise relationships.
describe(npk)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## block* 1 24 3.50 1.74 3.50 3.50 2.22 1.0 6.0 5.0 0.00 -1.41 0.36
## N* 2 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## P* 3 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## K* 4 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## yield 5 24 54.88 6.17 55.65 54.75 6.15 44.2 69.5 25.3 0.24 -0.51 1.26
pairs.panels(npk)
When we describe the data, the variables are marked by an * which indicates that they are not numeric (they are actuall Factors with levels), but are treated by describe as numeric. We can examine the str (structure) of the data to see this.
The pairs.panels function automatically converts Factor or Character values to numeric.
Unfortunately, setCor does not. (But lmCor does!)
We needed to this by hand but don’t anymore.
str(npk)
## 'data.frame': 24 obs. of 5 variables:
## $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ N : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
## $ P : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
## $ K : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
## $ yield: num 49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
Examining the correlations of the npk data we see that there are four experimental variables (IVs) and one dependent variable (DV). Because this is a balanced experiment, we see that the correlation of the IVs are all exactly zero and that they have different relationships with the DV.
Standardized regressions are just correlations. We can show the
relationships as a path model using the lmCor
function. By
default, we show the standardized solution which expresses everything as
correlations (Left panel) However we may also say that we do not want
standardized variables (right panel) to see the raw variance
components.
par(mfrow=c(1,2)) #set up a two panel graph
mod0 <- lmCor(yield ~ N + P + K, data=npk,main="Standardized Regression")#show this as a figure
mod0 <- lmCor(yield ~ N + P + K, data=npk,std=FALSE,main="Unstandardized Regression") #do not standardize
par(mfrow=c(1,1)) #set it back to a one panel graph
If the IV is categorical and has just two levels, we can examine the effect of the IV on the DV by doing a t.test
t.test(yield ~ K, data = npk)
##
## Welch Two Sample t-test
##
## data: yield by K
## t = 1.6374, df = 17.621, p-value = 0.1193
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.135414 9.102081
## sample estimates:
## mean in group 0 mean in group 1
## 56.86667 52.88333
The t-test is a special case of the linear model (in this case the X variable just has two values.) We can do the same analysis as the t-test but use the linear model function lm.
model <- lm(yield ~ K, data = npk)
summary(model)
##
## Call:
## lm(formula = yield ~ K, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.667 -4.083 1.217 4.167 12.633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.867 1.720 33.059 <2e-16 ***
## K1 -3.983 2.433 -1.637 0.116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.959 on 22 degrees of freedom
## Multiple R-squared: 0.1086, Adjusted R-squared: 0.06812
## F-statistic: 2.681 on 1 and 22 DF, p-value: 0.1158
Or, we can use the lmCor
function to do this.
model1 <- lmCor(yield ~ K, data = npk)
There are several ways of showing this result. We can do an error.bars
error.bars.by(yield ~ K,data=npk)
Or we can use the lmCor function to do the linear model and draw the graph. Note that the t statistic given for the regression slope is the same as the t statistic when doing the normal t-test.
lmCor(yield ~ K, data = npk) #same syntax as lm
## Call: lmCor(y = yield ~ K, data = npk)
##
## Multiple Regression from raw data
##
## DV = yield
## slope se t p lower.ci upper.ci VIF Vy.x
## (Intercept) 0.00 0.2 0.00 1.00 -0.42 0.42 1 0.00
## K -0.33 0.2 -1.64 0.12 -0.75 0.09 1 0.11
##
## Residual Standard Error = 0.97 with 22 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## yield 0.33 0.11 0.33 0.11 0.07 0.1 2.68 1 22 0.116
Analysis of variance is a special case of the linear model. We can see this when we compare two anovas with two multiple regresssions. The first is just a single Independent Variable (IV) while the second one uses all three predictors. The last analysis (using lmCor) does the same as the lm but draws a path model as well.
mod1 <- aov(yield ~ N,data=npk)
lm1 <- lm(yield ~ N, data=npk)
summary(mod1)
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.061 0.0221 *
## Residuals 22 687.1 31.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm1) #identical to the anova (in terms of p values, ts are sqrts of Fs)
##
## Call:
## lm(formula = yield ~ N, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8833 -3.7667 0.1667 3.5583 11.8167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.067 1.613 32.274 <2e-16 ***
## N1 5.617 2.281 2.462 0.0221 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.588 on 22 degrees of freedom
## Multiple R-squared: 0.216, Adjusted R-squared: 0.1803
## F-statistic: 6.061 on 1 and 22 DF, p-value: 0.02213
mod2 <- aov(yield ~ N + P + K,data=npk)
lm2 <- lm(yield ~ N + P + K,data=npk)
summary(mod2)
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.488 0.0192 *
## P 1 8.4 8.40 0.288 0.5974
## K 1 95.2 95.20 3.263 0.0859 .
## Residuals 20 583.5 29.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm2) #also identical in terms of p values
##
## Call:
## lm(formula = yield ~ N + P + K, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2667 -3.6542 0.7083 3.4792 9.3333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.650 2.205 24.784 <2e-16 ***
## N1 5.617 2.205 2.547 0.0192 *
## P1 -1.183 2.205 -0.537 0.5974
## K1 -3.983 2.205 -1.806 0.0859 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.401 on 20 degrees of freedom
## Multiple R-squared: 0.3342, Adjusted R-squared: 0.2343
## F-statistic: 3.346 on 3 and 20 DF, p-value: 0.0397
#now do the analysis but draw a path model as well
lmCor(yield ~ N + P + K,data=npk,main = "Regression Model using lmCor")
## Call: lmCor(y = yield ~ N + P + K, data = npk, main = "Regression Model using lmCor")
##
## Multiple Regression from raw data
##
## DV = yield
## slope se t p lower.ci upper.ci VIF Vy.x
## (Intercept) 0.00 0.18 0.00 1.000 -0.38 0.38 1 0.00
## N 0.46 0.18 2.55 0.019 0.08 0.85 1 0.22
## P -0.10 0.18 -0.54 0.600 -0.48 0.28 1 0.01
## K -0.33 0.18 -1.81 0.086 -0.71 0.05 1 0.11
##
## Residual Standard Error = 0.88 with 20 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## yield 0.58 0.33 0.52 0.27 0.23 0.12 3.35 3 20 0.0397
Note that although the output looks very different, the t value’s of the path coefficients are the same as the square roots of the F values in the aov model.
Regression is most appropriate with continuous predictors. We do this for one predictor (SATV) and one DV (ACT). We show the data points, fit with the best fitting linear model, and then show the model.
plot(ACT ~ SATV, data= sat.act) #show the points
abline(lm(ACT ~ SATV, data=sat.act)) #show the best fitting line
summary(lm(ACT ~ SATV, data=sat.act)) #summarize the results
##
## Call:
## lm(formula = ACT ~ SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.666 -2.454 -0.031 2.548 16.334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.871813 0.833334 16.65 <2e-16 ***
## SATV 0.023970 0.001339 17.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.996 on 698 degrees of freedom
## Multiple R-squared: 0.3148, Adjusted R-squared: 0.3138
## F-statistic: 320.7 on 1 and 698 DF, p-value: < 2.2e-16
But typically, we have more predictors. Although showing the individual regressions is possible (and we do as a two panel graph) we can also summarize them graphically by a path model. (As we did earlier). When doing regressions, the slopes are in the units of the predictor (e.g. SATV), and are sometimes hard to interpret. Thus, we can express the slopes as standardized values to make them more understandable.
par(mfrow=c(1,2)) #create a two panel graph
plot(ACT ~ SATV, data= sat.act) #show the points
abline(lm(ACT ~ SATV, data=sat.act))
plot(ACT ~ SATQ, data= sat.act) #show the points
abline(lm(ACT ~ SATQ, data=sat.act))
#do another 2 panel graph We supress the actual output
mod3 <- lmCor(ACT ~ SATV + SATQ, data=sat.act,main="Standardized (Correlations)")
mod4 <- lmCor(ACT ~ SATV + SATQ, data=sat.act, std=FALSE,main="unstandardized")
par(mfrow=c(1,1)) #convert back to the normal 1 panel graph
summary(lm(ACT ~ SATV + SATQ, data=sat.act))
##
## Call:
## lm(formula = ACT ~ SATV + SATQ, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6795 -2.3076 0.0831 2.1924 18.5489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.565946 0.854058 12.371 < 2e-16 ***
## SATV 0.013284 0.001649 8.054 3.56e-15 ***
## SATQ 0.016142 0.001616 9.990 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.743 on 684 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.4015, Adjusted R-squared: 0.3997
## F-statistic: 229.4 on 2 and 684 DF, p-value: < 2.2e-16
An important question is whether one variable affects the relationship between another IV and DV. In analysis of variance terms, this is asking if there is an interaction. The standard anova approach just does this by including the interaction term (shown as a product in the formula). But when we try this with regression, things get more complicated.
mod2 <- aov(yield ~ N * P * K,data = npk)
summary(mod2)
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.161 0.0245 *
## P 1 8.4 8.40 0.273 0.6082
## K 1 95.2 95.20 3.099 0.0975 .
## N:P 1 21.3 21.28 0.693 0.4175
## N:K 1 33.1 33.14 1.078 0.3145
## P:K 1 0.5 0.48 0.016 0.9019
## N:P:K 1 37.0 37.00 1.204 0.2887
## Residuals 16 491.6 30.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But when do this as a linear model, it doesn’t match! Compare the p values. Only the 3 way interaction is correct.
summary(lm(yield ~ N * P * K,data = npk))
##
## Call:
## lm(formula = yield ~ N * P * K, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.133 -4.133 1.250 3.125 8.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.4333 3.2002 16.072 2.7e-11 ***
## N1 12.3333 4.5258 2.725 0.015 *
## P1 2.9000 4.5258 0.641 0.531
## K1 0.5667 4.5258 0.125 0.902
## N1:P1 -8.7333 6.4004 -1.365 0.191
## N1:K1 -9.6667 6.4004 -1.510 0.150
## P1:K1 -4.4000 6.4004 -0.687 0.502
## N1:P1:K1 9.9333 9.0515 1.097 0.289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.543 on 16 degrees of freedom
## Multiple R-squared: 0.4391, Adjusted R-squared: 0.1937
## F-statistic: 1.789 on 7 and 16 DF, p-value: 0.1586
The problem is that the product terms are correlated with the main effects. We can show this by forming the product terms. Although tedious, it is possible to find the 4 product terms. We do this, and then put the results into a data.frame which we when plot. Note how the product terms are correlated with their elements. This makes sense, but is wrong if we want to do the modeling.
First we create a new data.frame (NPK) which is made up of the numeric equivalents of what was in npk.
NPK <- char2numeric(npk,flag=FALSE)
describe(npk) #the original data are not technically numeric, they are factors
## vars n mean sd median trimmed mad min max range skew kurtosis se
## block* 1 24 3.50 1.74 3.50 3.50 2.22 1.0 6.0 5.0 0.00 -1.41 0.36
## N* 2 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## P* 3 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## K* 4 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## yield 5 24 54.88 6.17 55.65 54.75 6.15 44.2 69.5 25.3 0.24 -0.51 1.26
describe(NPK) #we have converted them to numeric
## vars n mean sd median trimmed mad min max range skew kurtosis se
## block 1 24 3.50 1.74 3.50 3.50 2.22 1.0 6.0 5.0 0.00 -1.41 0.36
## N 2 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## P 3 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## K 4 24 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.08 0.10
## yield 5 24 54.88 6.17 55.65 54.75 6.15 44.2 69.5 25.3 0.24 -0.51 1.26
Now show the correlations of the product terms and the main effects.
We use the gap
parameter in pairs.panels
to
show a slightly more compact graphic.
NP <- NPK$N * NPK$P
NK <- NPK$N * NPK$K
PK <- NPK$P * NPK$K
PKN <- PK * NPK$N
NPK.prods <- data.frame(NPK,NP,NK,PK,PKN)
pairs.panels(NPK.prods,gap=0) #show the correlations, tighten up the figure
The simple products are correlated with each other, and with the
basic factors. To get around this problem, we center the data
using the scale
function. scale
will center
and if we set scale=TRUE it will also standardize (divide by the
standard deviation). The default is to standardize, so we specify that
it is false. We also need to convert the output of the
scale
function from a matrix to a
data.frame.
centered.NPK <- scale(NPK,scale=FALSE)
centered.NPK <- data.frame(scale(NPK,scale=FALSE))
c.NP <- centered.NPK$N * centered.NPK$P
c.NK <- centered.NPK$N * centered.NPK$K
c.PK <- centered.NPK$P * centered.NPK$K
c.PKN <- c.PK * centered.NPK$N
center.prod <- data.frame(centered.NPK,c.NP,c.NK,c.PK,c.PKN)
pairs.panels(center.prod) #show the results grapically
lowerCor(center.prod) #show it a correlation matrix as well
## block N P K yield c.NP c.NK c.PK c.PKN
## block 1.00
## N 0.00 1.00
## P 0.00 0.00 1.00
## K 0.00 0.00 0.00 1.00
## yield -0.16 0.46 -0.10 -0.33 1.00
## c.NP 0.00 0.00 0.00 0.00 -0.16 1.00
## c.NK 0.00 0.00 0.00 0.00 -0.19 0.00 1.00
## c.PK 0.00 0.00 0.00 0.00 0.02 0.00 0.00 1.00
## c.PKN -0.29 0.00 0.00 0.00 0.21 0.00 0.00 0.00 1.00
Note the products are now uncorrelated. If we now use the linear model function, it agrees with our anova output.
lm.mod <- lm(yield ~ N * P * K, data = centered.NPK) #do the analysis on the centered data
summary(lm.mod) #show the moderated multiple regression
##
## Call:
## lm(formula = yield ~ N * P * K, data = centered.NPK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.133 -4.133 1.250 3.125 8.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.137e-15 1.131e+00 0.000 1.0000
## N 5.617e+00 2.263e+00 2.482 0.0245 *
## P -1.183e+00 2.263e+00 -0.523 0.6082
## K -3.983e+00 2.263e+00 -1.760 0.0975 .
## N:P -3.767e+00 4.526e+00 -0.832 0.4175
## N:K -4.700e+00 4.526e+00 -1.038 0.3145
## P:K 5.667e-01 4.526e+00 0.125 0.9019
## N:P:K 9.933e+00 9.052e+00 1.097 0.2887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.543 on 16 degrees of freedom
## Multiple R-squared: 0.4391, Adjusted R-squared: 0.1937
## F-statistic: 1.789 on 7 and 16 DF, p-value: 0.1586
summary(mod2) #compare with the original anova output
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.161 0.0245 *
## P 1 8.4 8.40 0.273 0.6082
## K 1 95.2 95.20 3.099 0.0975 .
## N:P 1 21.3 21.28 0.693 0.4175
## N:K 1 33.1 33.14 1.078 0.3145
## P:K 1 0.5 0.48 0.016 0.9019
## N:P:K 1 37.0 37.00 1.204 0.2887
## Residuals 16 491.6 30.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmCor
does the centering automatically when finding
product termsmod1 <- lmCor(yield ~ N * P * K, data = npk)
mod1 #show the results
## Call: lmCor(y = yield ~ N * P * K, data = npk)
##
## Multiple Regression from raw data
##
## DV = yield
## slope se t p lower.ci upper.ci VIF Vy.x
## (Intercept) 0.00 0.19 0.00 1.000 -0.40 0.40 1 0.00
## N 0.46 0.19 2.48 0.025 0.07 0.86 1 0.22
## P -0.10 0.19 -0.52 0.610 -0.49 0.30 1 0.01
## K -0.33 0.19 -1.76 0.097 -0.73 0.07 1 0.11
## N*P -0.16 0.19 -0.83 0.420 -0.55 0.24 1 0.02
## N*K -0.19 0.19 -1.04 0.310 -0.59 0.20 1 0.04
## P*K 0.02 0.19 0.13 0.900 -0.37 0.42 1 0.00
## N*P*K 0.21 0.19 1.10 0.290 -0.19 0.60 1 0.04
##
## Residual Standard Error = 0.9 with 16 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## yield 0.66 0.44 0.56 0.31 0.19 0.1 1.79 7 16 0.159
Once again, to the analysis correctly, we need to center the data.
cen.sat.act <- data.frame(scale(sat.act,scale=FALSE))
lm.mod2 <- lm(ACT ~ gender * SATV , data= cen.sat.act)
summary(lm.mod2)
##
## Call:
## lm(formula = ACT ~ gender * SATV, data = cen.sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4788 -2.3916 -0.0332 2.4952 15.6586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.005219 0.150824 -0.035 0.9724
## gender -0.255094 0.315645 -0.808 0.4193
## SATV 0.023913 0.001337 17.886 <2e-16 ***
## gender:SATV -0.005137 0.002785 -1.844 0.0655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.99 on 696 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.3159
## F-statistic: 108.6 on 3 and 696 DF, p-value: < 2.2e-16
There is a suggestion of an interaction between gender and SATV. To see this graphically, we draw the data (using a different plot character for males and females) and then draw two lines, one for males, and one for females. The interaction is a difference in the slopes of these regressions.
plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==1,]))
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==2,]),lty="dashed")
# or another way of doing this is with the by statment
plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
by(sat.act,sat.act$gender, function(x) abline(lm(ACT ~ SATV, data=x),lty=c("solid","dashed")[x$gender]))
## sat.act$gender: 1
## NULL
## ---------------------------------------------------------------------------
## sat.act$gender: 2
## NULL
by(sat.act,sat.act$gender,function(x) lm(ACT ~ SATV, data=x)) #this tells us which is which
## sat.act$gender: 1
##
## Call:
## lm(formula = ACT ~ SATV, data = x)
##
## Coefficients:
## (Intercept) SATV
## 12.03133 0.02724
##
## ---------------------------------------------------------------------------
## sat.act$gender: 2
##
## Call:
## lm(formula = ACT ~ SATV, data = x)
##
## Coefficients:
## (Intercept) SATV
## 14.9214 0.0221
text(220,17,"males")
text(220,22,"females")
Sometimes we want to take a categorical variable in which the categories are arbitrary (e.g., college major, ways of coping) and create a dummy code that has the value of 1 for people in that category and 0 for everyone else. Clearly we need at least 3 categories for this to make sense.
For following, we will take the education
variable from
the spi
data set, convert the education levels into 0/1
dummy codes, and then find those items that most correlate with each
level of education.
dummy.code
functiondummy.code #show the code for the function
## function (x, group = NULL, na.rm = TRUE, top = NULL, min = NULL)
## {
## t <- table(x)
## t <- sort(t, decreasing = TRUE)
## if (!is.null(min)) {
## top <- sum(t >= min)
## }
## if (is.null(top))
## top <- length(t)
## t <- t[1:top]
## lt <- length(t)
## n.obs <- length(x)
## if (is.null(group)) {
## new <- matrix(0, nrow = n.obs, ncol = lt)
## if (na.rm) {
## new[is.na(x), ] <- NA
## }
## xlev <- factor(x, levels = names(t))
## for (i in 1:n.obs) {
## new[i, xlev[i]] <- 1
## }
## colnames(new) <- names(t)
## }
## else {
## new <- rep(0, n.obs)
## xlev <- as.factor(x)
## if (na.rm) {
## new[is.na(x)] <- NA
## }
## for (i in 1:n.obs) {
## new[i] <- xlev[i] %in% group
## }
## }
## return(new)
## }
## <bytecode: 0x10bc3d0a8>
## <environment: namespace:psych>
We can examine each line of this code by trying it.
x <- spi$education
na.rm <- TRUE
t <- table(x) # Call the table function on x
t # we do not do this in the function, but we do it here to examine it
## x
## 1 2 3 4 5 6 7 8
## 394 369 1060 260 123 567 165 392
lt <- length(t) # how many different categories
Then we create where we want to put the results
n.obs <- length(x)
new <- matrix(0,nrow=n.obs,ncol=lt)
if(na.rm) {new[is.na(x),] <- NA } #added 10/20/17
headTail(new) #show the first and last 4 rows
## X1 X2 X3 X4 X5 X6 X7 X8
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## ... ... ... ... ... ... ... ... ...
## 3997 0 0 0 0 0 0 0 0
## 3998 0 0 0 0 0 0 0 0
## 3999 0 0 0 0 0 0 0 0
## 4000 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
xlev <- as.factor(x) #show the str of this
str(xlev) #because we will be using these to index the matrix
## Factor w/ 8 levels "1","2","3","4",..: NA 6 4 NA 1 4 7 2 8 8 ...
for (i in 1:n.obs) {new[i,xlev[i]] <- 1} #Do the work
headTail(new) #show the first and last 4 rows
## X1 X2 X3 X4 X5 X6 X7 X8
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 0 0 0 0 0 1 0 0
## 3 0 0 0 1 0 0 0 0
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## ... ... ... ... ... ... ... ... ...
## 3997 0 0 1 0 0 0 0 0
## 3998 0 1 0 0 0 0 0 0
## 3999 0 0 0 0 0 1 0 0
## 4000 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
colnames(new) <- names(t) #if we have names for the categories, they appear here
Now try this function on `spi[“education”]
What are the units of education? We can see them by looking at the help file for the spi data set. Respondents education: less than 12, HS grad, current univ, some univ, associate degree, college degree, in grad/prof, grad/prof degree
edu.codes <- dummy.code(spi$education)
dim(edu.codes) #we have created the new variables
## [1] 4000 8
describe(edu.codes) #the means should add up to 1 across all the variables
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 3 1 3330 0.32 0.47 0 0.27 0 0 1 1 0.78 -1.39 0.01
## 6 2 3330 0.17 0.38 0 0.09 0 0 1 1 1.75 1.08 0.01
## 1 3 3330 0.12 0.32 0 0.02 0 0 1 1 2.36 3.58 0.01
## 8 4 3330 0.12 0.32 0 0.02 0 0 1 1 2.37 3.62 0.01
## 2 5 3330 0.11 0.31 0 0.01 0 0 1 1 2.48 4.14 0.01
## 4 6 3330 0.08 0.27 0 0.00 0 0 1 1 3.14 7.89 0.00
## 7 7 3330 0.05 0.22 0 0.00 0 0 1 1 4.15 15.22 0.00
## 5 8 3330 0.04 0.19 0 0.00 0 0 1 1 4.91 22.10 0.00
Another function, bestScales
can use those codes to find
which items most correlate with each category. This function finds the
items that most correlate with the criterion.
bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary) #don't use education as a predictor
bs.edu
##
## Call = bestScales(x = spi[-6], criteria = edu.codes, dictionary = spi.dictionary)
## The items most correlated with the criteria yield r's of
## correlation n.items
## 3 0.28 1
## 6 0.22 1
## 1 0.35 10
## 8 0.42 6
## 2 0.13 3
## 4 0.16 4
## 7 NA 0
## 5 0.11 1
##
## The best items, their correlations and content are
## $`3`
## 3 item_id item item_scale resp_type B5 L27
## age -0.28 age age in years NA NA NA NA
##
## $`6`
## 6 item_id item item_scale resp_type B5 L27
## age 0.22 age age in years NA NA NA NA
##
## $`1`
## 1 item_id item item_scale resp_type B5
## age -0.33 age age in years NA NA NA
## smoke -0.14 smoke smoking (never ... > 20/day) NA NA NA
## q_578 0.13 q_578 Dislike myself. IPIP reg Neuro
## q_1505 0.13 q_1505 Panic easily. IPIP reg Neuro
## q_4296 0.13 q_4296 Tell a lot of lies. EPQ:P reg Agree
## q_952 0.13 q_952 Get angry easily. IPIP reg NA
## q_4249 0.13 q_4249 Would call myself a nervous person. EPQ:N reg Neuro
## q_501 0.11 q_501 Cheat to get ahead. IPIP reg Agree
## q_35 0.11 q_35 Act without thinking. IPIP reg NA
## q_811 0.11 q_811 Feel a sense of worthlessness or hopelessness. IPIP reg Neuro
## L27
## age NA
## smoke NA
## q_578 WellBeing
## q_1505 Anxiety
## q_4296 Honesty
## q_952 Irritability
## q_4249 Anxiety
## q_501 Honesty
## q_35 Impulsivity
## q_811 WellBeing
##
## $`8`
## 8 item_id item item_scale resp_type B5
## age 0.41 age age in years NA NA NA
## q_1132 0.15 q_1132 Have read the great literary classics. IPIP reg NA
## q_1024 -0.13 q_1024 Hang around doing nothing. IPIP reg NA
## q_422 0.12 q_422 Can handle a lot of information. IPIP reg Open
## q_4249 -0.11 q_4249 Would call myself a nervous person. EPQ:N reg Neuro
## q_1452 -0.10 q_1452 Neglect my duties. IPIP reg Consc
## L27
## age NA
## q_1132 ArtAppreciation
## q_1024 EasyGoingness
## q_422 Intellect
## q_4249 Anxiety
## q_1452 Industry
##
## $`2`
## 2 item_id item item_scale resp_type B5
## age -0.12 age age in years NA NA NA
## q_1694 -0.11 q_1694 Set high standards for myself and others. IPIP reg Consc
## p1edu -0.10 p1edu Parent 1 education NA NA NA
## L27
## age NA
## q_1694 Perfectionism
## p1edu NA
##
## $`4`
## 4 item_id item item_scale resp_type B5 L27
## smoke 0.16 smoke smoking (never ... > 20/day) NA NA NA NA
## age 0.12 age age in years NA NA NA NA
## q_2765 -0.11 q_2765 Am happy with my life. IPIP reg NA WellBeing
## health -0.11 health self rated health (poor ... excellent) NA NA NA NA
##
## $`7`
## [1] 7 y
## <0 rows> (or 0-length row.names)
##
## $`5`
## 5 item_id item item_scale resp_type B5 L27
## age 0.11 age age in years NA NA NA NA
The solution just listed is sensitive to sample variations. We can test how sensitive it is by using the boot strap. Boot strapping is basically taking the observed sample as the population and then sampling from that population multiple times (with replacement). This leads to a numer of samples from the ‘population’ that differ to some amount. By examining how much they vary we can get a degree of confidence in our results. We will do the bootstrap 100 times and then examine the pooled results.
bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary[1:2], n.iter=100)
bs.edu
##
## Call = bestScales(x = spi[-6], criteria = edu.codes, n.iter = 100, dictionary = spi.dictionary[1:2])
## derivation.mean derivation.sd validation.m validation.sd final.valid final.wtd N.wtd
## 3 0.29 0.0162 0.282 0.015 0.28 0.30 10
## 6 0.24 0.0187 0.224 0.021 0.22 0.21 10
## 1 0.36 0.0116 0.340 0.018 0.33 0.31 10
## 8 0.41 0.0168 0.402 0.026 0.42 0.36 10
## 2 0.15 0.0289 0.131 0.022 0.12 0.20 10
## 4 0.17 0.0324 0.149 0.027 0.16 0.21 10
## 7 0.11 0.0085 0.029 0.023 NA 0.14 10
## 5 0.12 0.0120 0.103 0.020 NA 0.14 10
##
## Best items on each scale with counts of replications
##
## Criterion = 3
## Freq mean.r sd.r item_id item
## age 100 -0.27 0.01 age age in years
##
## Criterion = 6
## Freq mean.r sd.r item_id item
## age 100 0.22 0.02 age age in years
##
## Criterion = 1
## Freq mean.r sd.r item_id item
## age 100 -0.33 0.01 age age in years
##
## Criterion = 8
## Freq mean.r sd.r item_id item
## age 100 0.41 0.02 age age in years
## q_1132 100 0.15 0.02 q_1132 Have read the great literary classics.
## q_1024 90 -0.12 0.02 q_1024 Hang around doing nothing.
##
## Criterion = 2
## Freq mean.r sd.r item_id item
## age 95 -0.12 0.01 age age in years
##
## Criterion = 4
## Freq mean.r sd.r item_id item
## smoke 100 0.16 0.02 smoke smoking (never ... > 20/day)
##
## Criterion = 7
## Freq mean.r sd.r item_id item
## q_871 8 -0.08 0.02 q_871 Feel that most people cant be trusted.
##
## Criterion = 5
## Freq mean.r sd.r item_id item
## age 67 0.11 0.02 age age in years