psych
and psychTools
have a few subtle
modifications since Monday. The version numbers have not changed, but
the dates have.
#install.packages("psych",repos="http://personality-project.org/r", type="source")
#install.packages("psychTools",repos="http://personality-project.org/r", type="source")
library(psych) #make it active
library(psychTools)
sessionInfo() #psych should be 2.3.5 and psychTools should be 2.3.5
## R Under development (unstable) (2023-03-17 r83997)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
##
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] psychTools_2.3.5 psych_2.3.5
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 fastmap_1.1.1 xfun_0.38 lattice_0.20-45 cachem_1.0.7
## [7] parallel_4.3.0 knitr_1.42 foreign_0.8-84 htmltools_0.5.5 rmarkdown_2.21 cli_3.6.1
## [13] grid_4.3.0 sass_0.4.5 jquerylib_0.1.4 mnormt_2.1.1 compiler_4.3.0 rstudioapi_0.14
## [19] tools_4.3.0 nlme_3.1-162 evaluate_0.20 bslib_0.4.2 yaml_2.3.7 rlang_1.1.0
## [25] jsonlite_1.8.4
packageDate("psych"); packageDate("psychTools") #should be >= "2023-05-09"
## [1] "2023-05-14"
## [1] "2023-05-09"
The typical measure in psychology, whether measuring attitudes,
temperament, or ability is composed of multiple items. This is partly
because our constructs are broad and we want to capture the various
nuances of our constructs. It is also because the reliability of a sum (
or average) of a set of items is much higher than any single item
Revelle and Condon, 2019
Reliability: from alpha to
omega. See class notes on
Classical test theory.
Thus, we form item composites known as scales. The various steps to
do this are discussed in the How to
scoring scales linked here and in the syllabus. That
How to
links a number of other pages that are meant to
facilitate your work.
This Rmd and HTML files allow you to do each step yourself.
For the first law of data analysis is you must have some data. Choose an appropriate subject population, write items that you (and others) think measure the constructs, and then give these items to your participants. Make sure that they are engaged in doing the task.
R
Make psych
active.
library(psych)
library(psychTools)
Core R functions:
file.name <- file.choose() my.data <- read.table(file.name)
psych
functions.
`my.data <- read.file()’ #locate the file to read using your normal system.
or read from a remote file: e.g.,
file.name <-
"https://personality-project.org/r/psych/HowTo/scoring.tutorial/small.msq.txt"
my.data <- read.file(file.name)
## Data from the .txt file https://personality-project.org/r/psych/HowTo/scoring.tutorial/small.msq.txt has been loaded.
### checking basic qualities of the data
dim(my.data) #how many rows (subjects) and columns (variables)
## [1] 200 14
headTail(my.data) #show the first and last few rows
## active alert aroused sleepy tired drowsy anxious jittery nervous calm relaxed at.ease gender drug
## 1 2 2 0 0 1 1 1 1 0 2 2 2 2 2
## 2 1 1 0 2 2 2 1 0 1 2 2 2 2 1
## 3 1 1 1 3 3 2 2 0 1 2 2 2 1 1
## 4 1 2 0 1 1 0 0 0 0 2 2 2 2 2
## ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
## 197 0 0 0 3 2 2 1 0 0 1 2 1 <NA> <NA>
## 198 2 0 1 3 3 3 0 0 1 2 3 3 <NA> <NA>
## 199 1 1 1 1 1 0 0 0 0 3 3 3 <NA> <NA>
## 200 0 1 0 3 3 3 0 0 0 3 2 1 <NA> <NA>
describe(my.data) #basic descriptive statistics
## vars n mean sd median trimmed mad min max range skew kurtosis se
## active 1 200 0.86 0.84 1.0 0.78 1.48 0 3 3 0.62 -0.45 0.06
## alert 2 199 0.97 0.85 1.0 0.90 1.48 0 3 3 0.54 -0.42 0.06
## aroused 3 200 0.52 0.76 0.0 0.38 0.00 0 3 3 1.30 0.92 0.05
## sleepy 4 198 1.60 1.08 2.0 1.62 1.48 0 3 3 -0.05 -1.31 0.08
## tired 5 200 1.78 1.01 2.0 1.86 1.48 0 3 3 -0.30 -1.04 0.07
## drowsy 6 200 1.50 1.08 1.0 1.49 1.48 0 3 3 0.02 -1.28 0.08
## anxious 7 200 0.60 0.83 0.0 0.45 0.00 0 3 3 1.31 1.05 0.06
## jittery 8 200 0.66 0.87 0.0 0.50 0.00 0 3 3 1.22 0.63 0.06
## nervous 9 200 0.44 0.77 0.0 0.25 0.00 0 3 3 1.81 2.63 0.05
## calm 10 200 1.55 0.91 2.0 1.57 1.48 0 3 3 -0.06 -0.80 0.06
## relaxed 11 199 1.67 0.86 2.0 1.71 1.48 0 3 3 -0.20 -0.63 0.06
## at.ease 12 198 1.40 0.96 1.0 1.38 1.48 0 3 3 0.08 -0.96 0.07
## gender 13 136 1.54 0.50 2.0 1.55 0.00 1 2 1 -0.18 -1.98 0.04
## drug 14 136 1.50 0.50 1.5 1.50 0.74 1 2 1 0.00 -2.01 0.04
The first 12 variables are clearly affective/mood variables. Should
we form one, or two or many scales from these data? On solution is to
examine the data in terms of their dimensionality. fa
will
do a factor analysis.
We can compare a 1 factor solution to a 2 factor solution.
f1 <- fa(my.data[,1:12]) #just look at the mood variables
f1
## Factor Analysis using method = minres
## Call: fa(r = my.data[, 1:12])
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 h2 u2 com
## active 0.53 0.282 0.72 1
## alert 0.64 0.414 0.59 1
## aroused 0.55 0.304 0.70 1
## sleepy -0.71 0.499 0.50 1
## tired -0.65 0.421 0.58 1
## drowsy -0.65 0.416 0.58 1
## anxious 0.53 0.281 0.72 1
## jittery 0.61 0.375 0.62 1
## nervous 0.46 0.208 0.79 1
## calm -0.37 0.138 0.86 1
## relaxed -0.29 0.087 0.91 1
## at.ease -0.29 0.084 0.92 1
##
## MR1
## SS loadings 3.51
## Proportion Var 0.29
##
## Mean item complexity = 1
## Test of the hypothesis that 1 factor is sufficient.
##
## df null model = 66 with the objective function = 6.17 with Chi Square = 1198.51
## df of the model are 54 and the objective function was 3.65
##
## The root mean square of the residuals (RMSR) is 0.21
## The df corrected root mean square of the residuals is 0.23
##
## The harmonic n.obs is 199 with the empirical chi square 1119.48 with prob < 5.9e-199
## The total n.obs was 200 with Likelihood Chi Square = 707.08 with prob < 1.4e-114
##
## Tucker Lewis Index of factoring reliability = 0.293
## RMSEA index = 0.246 and the 90 % confidence intervals are 0.231 0.263
## BIC = 420.97
## Fit based upon off diagonal values = 0.66
## Measures of factor score adequacy
## MR1
## Correlation of (regression) scores with factors 0.92
## Multiple R square of scores with factors 0.85
## Minimum correlation of possible factor scores 0.69
Compare this to a two factor solution
f2 <- fa(my.data[,1:12],2) #specify that we want two factors
## Loading required namespace: GPArotation
f2
## Factor Analysis using method = minres
## Call: fa(r = my.data[, 1:12], nfactors = 2)
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 h2 u2 com
## active 0.65 0.07 0.41 0.59 1.0
## alert 0.73 0.00 0.53 0.47 1.0
## aroused 0.52 -0.13 0.30 0.70 1.1
## sleepy -0.77 0.06 0.60 0.40 1.0
## tired -0.79 -0.05 0.61 0.39 1.0
## drowsy -0.76 -0.03 0.58 0.42 1.0
## anxious 0.25 -0.54 0.39 0.61 1.4
## jittery 0.34 -0.54 0.46 0.54 1.7
## nervous 0.13 -0.62 0.43 0.57 1.1
## calm 0.01 0.72 0.52 0.48 1.0
## relaxed 0.12 0.75 0.55 0.45 1.0
## at.ease 0.13 0.76 0.56 0.44 1.1
##
## MR1 MR2
## SS loadings 3.25 2.70
## Proportion Var 0.27 0.22
## Cumulative Var 0.27 0.50
## Proportion Explained 0.55 0.45
## Cumulative Proportion 0.55 1.00
##
## With factor correlations of
## MR1 MR2
## MR1 1.00 -0.15
## MR2 -0.15 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## df null model = 66 with the objective function = 6.17 with Chi Square = 1198.51
## df of the model are 43 and the objective function was 1.65
##
## The root mean square of the residuals (RMSR) is 0.09
## The df corrected root mean square of the residuals is 0.11
##
## The harmonic n.obs is 199 with the empirical chi square 223.16 with prob < 3.6e-26
## The total n.obs was 200 with Likelihood Chi Square = 319.12 with prob < 7.6e-44
##
## Tucker Lewis Index of factoring reliability = 0.623
## RMSEA index = 0.179 and the 90 % confidence intervals are 0.161 0.198
## BIC = 91.29
## Fit based upon off diagonal values = 0.93
## Measures of factor score adequacy
## MR1 MR2
## Correlation of (regression) scores with factors 0.94 0.92
## Multiple R square of scores with factors 0.87 0.84
## Minimum correlation of possible factor scores 0.75 0.69
fa.plot(f2, labels = colnames(my.data)[1:12])
## Create scoring keys for a 1 and 2 factor solution
In order to score the data, we need to specify which items we want to use and in which direction we want to score them.
We can do this by hand, or use a simple function. Lets do it by hand. We will call the two factor solution energy and tension.
keys <- list(onefactor = cs(active,alert, aroused, -sleepy,-tired, -drowsy,anxious, jittery,nervous, -calm,-relaxed, -at.ease),
energy=cs(active,alert, aroused, -sleepy,-tired, -drowsy),
tension =cs(anxious, jittery,nervous, -calm,-relaxed, -at.ease))
keys
## $onefactor
## [1] "active" "alert" "aroused" "-sleepy" "-tired" "-drowsy" "anxious" "jittery" "nervous"
## [10] "-calm" "-relaxed" "-at.ease"
##
## $energy
## [1] "active" "alert" "aroused" "-sleepy" "-tired" "-drowsy"
##
## $tension
## [1] "anxious" "jittery" "nervous" "-calm" "-relaxed" "-at.ease"
We can use these keys to score the items. There are several functions
that will do this. scoreFast
just finds the mean for each
set of items. scoreItems
does this, but also report various
statistics. scoreOverlap
will not find scores, but will
find the correlations of the scales correcting for item overlap.
For each of these functions, we specify the scoring keys
information.
simple.scores <- scoreFast(keys,my.data)
dim(simple.scores)
## [1] 200 3
scales <- scoreItems(keys,my.data) #more complicated output
scales #does not correct for overlapping items
## Call: scoreItems(keys = keys, items = my.data)
##
## (Unstandardized) Alpha:
## onefactor energy tension
## alpha 0.82 0.85 0.83
##
## Standard errors of unstandardized Alpha:
## onefactor energy tension
## ASE 0.027 0.033 0.037
##
## Average item correlation:
## onefactor energy tension
## average.r 0.28 0.49 0.44
##
## Median item correlation:
## onefactor energy tension
## 0.27 0.48 0.40
##
## Guttman 6* reliability:
## onefactor energy tension
## Lambda.6 0.89 0.89 0.85
##
## Signal/Noise based upon av.r :
## onefactor energy tension
## Signal/Noise 4.6 5.9 4.7
##
## Scale intercorrelations corrected for attenuation
## raw correlations below the diagonal, alpha on the diagonal
## corrected correlations above the diagonal:
## onefactor energy tension
## onefactor 0.82 0.97 0.91
## energy 0.81 0.85 0.25
## tension 0.75 0.21 0.83
##
## Average adjusted correlations within and between scales (MIMS)
## onfct enrgy tensn
## onefactor 0.28
## energy 0.28 0.49
## tension 0.22 0.10 0.44
##
## Average adjusted item x scale correlations within and between scales (MIMT)
## onfct enrgy tensn
## onefactor 0.58
## energy 0.61 0.76
## tension 0.55 0.16 0.73
##
## In order to see the item by scale loadings and frequency counts of the data
## print with the short option = FALSE
names(scales) #many objects are returned
## [1] "scores" "missing" "alpha" "av.r" "sn" "n.items"
## [7] "item.cor" "cor" "corrected" "G6" "item.corrected" "response.freq"
## [13] "raw" "ase" "med.r" "keys" "MIMS" "MIMT"
## [19] "Call"
scales.scores <- scales$scores # get the scores
cor2(simple.scores,scales.scores)
## onefactor energy tension
## onefactor-A 1.00 0.81 0.74
## energy-A 0.81 1.00 0.21
## tension-A 0.75 0.22 1.00
overlap <- scoreOverlap(keys, my.data) #corrects for item overlap
overlap
## Call: scoreOverlap(keys = keys, r = my.data)
##
## (Standardized) Alpha:
## onefactor energy tension
## 0.82 0.85 0.83
##
## (Standardized) G6*:
## onefactor energy tension
## 0.74 0.68 0.68
##
## Average item correlation:
## onefactor energy tension
## 0.28 0.49 0.44
##
## Median item correlation:
## onefactor energy tension
## 0.28 0.49 0.40
##
## Number of items:
## onefactor energy tension
## 12 6 6
##
## Signal to Noise ratio based upon average r and n
## onefactor energy tension
## 4.7 5.8 4.8
##
## Scale intercorrelations corrected for item overlap and attenuation
## adjusted for overlap correlations below the diagonal, alpha on the diagonal
## corrected correlations above the diagonal:
## onefactor energy tension
## onefactor 0.82 0.80 0.78
## energy 0.67 0.85 0.26
## tension 0.65 0.22 0.83
##
## Percentage of keyed items with highest absolute correlation with scale (scale quality)
## onefactor energy tension
## 0.083 1.000 0.833
##
## Average adjusted correlations within and between scales (MIMS)
## onfct enrgy tensn
## onefactor 0.28
## energy 0.30 0.49
## tension 0.28 0.12 0.44
##
## Average adjusted item x scale correlations within and between scales (MIMT)
## onfct enrgy tensn
## onefactor 0.60
## energy 0.63 0.81
## tension 0.58 0.20 0.75
##
## In order to see the item by scale loadings and frequency counts of the data
## print with the short option = FALSE
names(overlap)
## [1] "cor" "sd" "corrected" "alpha" "av.r" "size" "sn" "G6"
## [9] "item.cor" "med.r" "quality" "MIMS" "MIMT" "Call"
scales$cor #the cor object of scales
## onefactor energy tension
## onefactor 1.0000000 0.8090324 0.7457055
## energy 0.8090324 1.0000000 0.2116871
## tension 0.7457055 0.2116871 1.0000000
overlap$cor #the cor object of overlap
## onefactor energy tension
## onefactor 1.0000000 0.6719402 0.6464668
## energy 0.6719402 1.0000000 0.2216840
## tension 0.6464668 0.2216840 1.0000000
lowerCor(scales$scores) #the observed correlations
## onfct enrgy tensn
## onefactor 1.00
## energy 0.81 1.00
## tension 0.75 0.21 1.00
We will join the scales.scores output with the original data (I am
using cbind to get around a problem with vJoin
)
scales.drugs <- cbind(scales.scores,my.data[,13:14])
describe(scales.drugs)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## onefactor 1 200 1.04 0.53 1.00 1.00 0.49 0 2.83 2.83 0.85 1.12 0.04
## energy 2 200 1.08 0.72 1.08 1.05 0.86 0 3.00 3.00 0.36 -0.43 0.05
## tension 3 200 1.01 0.63 1.00 0.96 0.49 0 3.00 3.00 0.88 0.81 0.04
## gender 4 136 1.54 0.50 2.00 1.55 0.00 1 2.00 1.00 -0.18 -1.98 0.04
## drug 5 136 1.50 0.50 1.50 1.50 0.74 1 2.00 1.00 0.00 -2.01 0.04
cd <-cohen.d(scales.drugs ~ drug)
error.dots(cd)
Although the energy and tensions scales are independent, they are both affected by the drug manipulation.
We can use a technique known as factor extension
to see
how the drug effect fits into the two space of the items.
Factor extension was originally developed to handle the case of variables not measured in the original factor analysis.
The fa.extend
function allows us to do this.
f2e <- fa.extend(my.data, 2, ov =1:12, ev =14)
f2e #show it
## Factor Analysis using method = minres
## Call: fa.extend(r = my.data, nfactors = 2, ov = 1:12, ev = 14)
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 h2 u2 com
## active 0.65 0.07 0.41 0.59 1.0
## alert 0.73 0.00 0.53 0.47 1.0
## aroused 0.52 -0.13 0.30 0.70 1.1
## sleepy -0.77 0.06 0.60 0.40 1.0
## tired -0.79 -0.05 0.61 0.39 1.0
## drowsy -0.76 -0.03 0.58 0.42 1.0
## anxious 0.25 -0.54 0.39 0.61 1.4
## jittery 0.34 -0.54 0.46 0.54 1.7
## nervous 0.13 -0.62 0.43 0.57 1.1
## calm 0.01 0.72 0.52 0.48 1.0
## relaxed 0.12 0.75 0.55 0.45 1.0
## at.ease 0.13 0.76 0.56 0.44 1.1
## drug 0.40 -0.19 0.22 0.78 1.4
##
## MR1 MR2
## SS loadings 3.43 2.74
## Proportion Var 0.26 0.21
## Cumulative Var 0.26 0.47
## Proportion Explained 0.56 0.44
## Cumulative Proportion 0.56 1.00
##
## With factor correlations of
## MR1 MR2
## MR1 1.00 -0.15
## MR2 -0.15 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## df null model = 78 with the objective function = 6.46 with Chi Square = 1251.6
## df of the model are 53 and the objective function was 1.71
##
## The root mean square of the residuals (RMSR) is 0.09
## The df corrected root mean square of the residuals is 0.11
##
## The harmonic n.obs is 186 with the empirical chi square 228.55 with prob < 1.2e-23
## The total n.obs was 200 with Likelihood Chi Square = 330.08 with prob < 1.1e-41
##
## Tucker Lewis Index of factoring reliability = 0.65
## RMSEA index = 0.162 and the 90 % confidence intervals are 0.146 0.179
## BIC = 49.27
## Fit based upon off diagonal values = 0.94
## Measures of factor score adequacy
## MR1 MR2
## Correlation of (regression) scores with factors 0.94 0.92
## Multiple R square of scores with factors 0.88 0.85
## Minimum correlation of possible factor scores 0.76 0.69
fa.plot(f2e, labels = colnames(my.data)) #plot it
diagram(f2e, ic=TRUE ) #another way to show it
First a little function
"bullseye" <- function(x,y,n) {
for(i in 1:n) {dia.ellipse(x,y,e.size=i)}}
"rel.val" <- function(n,sdx=.2,bars=TRUE,arrow.len=.05) {
if(n>20) {pc <- "."} else {pc <- 16}
plot(NA,xlim=c(0,10),ylim=c(0,10),axes=FALSE,xlab="",ylab="",main="Reliability and Validity as target shooting")
#Reliable and valid
x=3
y=2
bullseye(x,y,4)
x1 <- x + rnorm(n,0,sdx)
y1 <- y + rnorm(n,0,sdx)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Reliable and valid")
#unReliable and invalid
x=7
y=7
bullseye(x,y,4)
x1 <- x + rnorm(n,1,1)
y1 <- y + rnorm(n,1,1)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Unreliable and Invalid")
#reliable and invalid
x=7
y=2
bullseye(x,y,4)
x1 <- x + rnorm(n,1,sdx)
y1 <- y + rnorm(n,1,sdx)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars)error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Reliable and Invalid")
#unreliable, but valid
x=3
y=7
bullseye(x,y,4)
x1 <- x + rnorm(n,0,1)
y1 <- y + rnorm(n,0,1)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Unreliable but Valid")
}
rel.val(6)
The spi
data set has 135 personality items to measure
the big few
as well as a lower level set of 27 lower level
factors. There are also 10 criteria.
We can score the Big 5 scales and then use these to predict criteria using linear regression.
keys <- spi.keys[1:5] #just use the big 5 keys
b5 <- scoreFast(keys,spi)
# use them to predict criteria
b5.crit <- vJoin(spi[1:10], b5)
colnames(b5.crit)
## [1] "age" "sex" "health" "p1edu" "p2edu" "education" "wellness" "exer"
## [9] "smoke" "ER" "Agree-A" "Consc-A" "Neuro-A" "Extra-A" "Open-A"
Now apply a linear regression, predicting the first 10 variables from the big 5 scores. Do not show the plot of the results.
model <- lmCor(y=1:10,x=11:15, data=b5.crit, plot=FALSE)
summary(model)
##
## Multiple Regression from raw data
## lmCor(y = 1:10, x = 11:15, data = b5.crit, plot = FALSE)
##
## Multiple Regression from matrix input
##
## Beta weights
## age sex health p1edu p2edu education wellness exer smoke ER
## (Intercept) 0.00 0.000 0.0000 0.000 0.000 0.000 0.0000 0.0000 0.000 0.000
## Agree-A 0.16 0.162 0.0063 0.015 0.014 0.116 0.0631 -0.0053 -0.083 -0.025
## Consc-A 0.13 0.103 0.1715 -0.034 -0.049 0.065 0.1053 0.1613 -0.082 0.016
## Neuro-A -0.14 0.286 -0.2721 -0.036 -0.033 -0.147 0.0302 -0.1247 0.058 0.131
## Extra-A -0.11 0.086 0.1436 0.047 0.061 -0.086 0.0918 0.0876 0.084 0.050
## Open-A 0.12 -0.122 0.0126 0.058 0.057 0.142 0.0031 0.0675 0.090 -0.012
##
## Multiple R
## age sex health p1edu p2edu education wellness exer smoke ER
## 0.306 0.360 0.405 0.098 0.109 0.264 0.170 0.267 0.181 0.133
##
## Multiple R2
## age sex health p1edu p2edu education wellness exer smoke ER
## 0.0939 0.1296 0.1642 0.0096 0.0118 0.0699 0.0288 0.0711 0.0329 0.0176
##
## Cohen's set correlation R2
## [1] 0.4
##
## Squared Canonical Correlations
## [1] 0.2394 0.1332 0.0620 0.0298 0.0079 0.0000
Organize the results in a better way. Sort by R2
ord <- order(model$R2)
model.s <- lmCor(y=ord, x=11:15,data=b5.crit,plot=FALSE)
summary(model.s)
##
## Multiple Regression from raw data
## lmCor(y = ord, x = 11:15, data = b5.crit, plot = FALSE)
##
## Multiple Regression from matrix input
##
## Beta weights
## p1edu p2edu ER wellness smoke education exer age sex health
## (Intercept) 0.000 0.000 0.000 0.0000 0.000 0.000 0.0000 0.00 0.000 0.0000
## Agree-A 0.015 0.014 -0.025 0.0631 -0.083 0.116 -0.0053 0.16 0.162 0.0063
## Consc-A -0.034 -0.049 0.016 0.1053 -0.082 0.065 0.1613 0.13 0.103 0.1715
## Neuro-A -0.036 -0.033 0.131 0.0302 0.058 -0.147 -0.1247 -0.14 0.286 -0.2721
## Extra-A 0.047 0.061 0.050 0.0918 0.084 -0.086 0.0876 -0.11 0.086 0.1436
## Open-A 0.058 0.057 -0.012 0.0031 0.090 0.142 0.0675 0.12 -0.122 0.0126
##
## Multiple R
## p1edu p2edu ER wellness smoke education exer age sex health
## 0.098 0.109 0.133 0.170 0.181 0.264 0.267 0.306 0.360 0.405
##
## Multiple R2
## p1edu p2edu ER wellness smoke education exer age sex health
## 0.0096 0.0118 0.0176 0.0288 0.0329 0.0699 0.0711 0.0939 0.1296 0.1642
##
## Cohen's set correlation R2
## [1] 0.4
##
## Squared Canonical Correlations
## [1] 0.2394 0.1332 0.0620 0.0298 0.0079 0.0000
Any data analytic technique will best fit the observed data. But what happens when we apply it to a new data set? This is the process of cross validation.
One typical technique is to split the sample into 2 parts and derive on one part and then apply the model to the second part.
The crossValidation
function will help us do this.
First, we divide the sample into two random parts, derive the model on the first part, and then apply it to the second part.
n.obs <- NROW(spi)
set.seed(42) #for reprodicibile results
ss <- sample(n.obs, n.obs/2) #sampling without replacement
derivation.model <- model <- lmCor(y=ord,x=11:15, data=b5.crit[ss,], plot=FALSE)
summary(derivation.model)
##
## Multiple Regression from raw data
## lmCor(y = ord, x = 11:15, data = b5.crit[ss, ], plot = FALSE)
##
## Multiple Regression from matrix input
##
## Beta weights
## p1edu p2edu ER wellness smoke education exer age sex health
## (Intercept) 0.000 0.000 0.0000 0.000 0.000 0.000 0.000 0.00 0.000 0.000
## Agree-A 0.021 0.018 -0.0176 0.054 -0.071 0.103 -0.020 0.14 0.161 -0.037
## Consc-A -0.061 -0.067 0.0091 0.081 -0.095 0.062 0.194 0.12 0.088 0.166
## Neuro-A -0.016 -0.029 0.1377 0.036 0.066 -0.155 -0.145 -0.13 0.299 -0.263
## Extra-A 0.053 0.072 0.0417 0.084 0.077 -0.085 0.095 -0.10 0.088 0.163
## Open-A 0.072 0.078 -0.0259 0.028 0.082 0.142 0.079 0.14 -0.118 0.015
##
## Multiple R
## p1edu p2edu ER wellness smoke education exer age sex health
## 0.11 0.13 0.14 0.15 0.18 0.26 0.31 0.30 0.36 0.39
##
## Multiple R2
## p1edu p2edu ER wellness smoke education exer age sex health
## 0.013 0.018 0.020 0.021 0.033 0.069 0.093 0.088 0.132 0.155
##
## Cohen's set correlation R2
## [1] 0.41
##
## Squared Canonical Correlations
## [1] 0.2455 0.1230 0.0638 0.0388 0.0095 0.0000
#now cross validate
cross.v <- crossValidation(derivation.model, data=b5.crit[-ss,])
summary(cross.v)
## Cross Validation
## Call:crossValidation(model = derivation.model, data = b5.crit[-ss,
## ])
##
## Validities from raw items and from the correlation matrix
## Number of unique predictors used = 5
## items mat
## p1edu 0.072 0.072
## p2edu 0.082 0.082
## ER 0.122 0.122
## wellness 0.189 0.187
## smoke 0.179 0.180
## education 0.262 0.267
## exer 0.227 0.226
## age 0.313 0.313
## sex 0.355 0.355
## health 0.408 0.411
Those are pretty good. But what would happen if we had smaller samples for the derivation set?
Widaman and Revelle (2023) have discussed this problem in a comparison with factor scores. Here we take their code and apply it just to the regression problem.
selectB5 <- selectFromKeys(spi.keys[1:5])
b5 <- scoreItems(spi.keys[1:5], spi)
#Create samples of 250 and 1000
#find factor scores and unit weighted scores
#use these to find multiple R
#then, use the empirical factor weights from each sample to cross validate to the rest
big.list <- list()
N=250
N1=1000
set.seed(42) #this makes for reproducible results
n.iter<- 100
for (i in 1: n.iter) {
ss <- sample(NROW(spi),NROW(spi))
lastss <- length(ss)
f5.samp.250 <- fa(spi[ss[1:N],selectB5],5) # based upon the first 250
f5.samp.1000 <- fa(spi[ss[1:N1],selectB5],5) # based upon the first 1000
b5.scores <- scoreItems(spi.keys[1:5],spi[ss,])$scores
b5.samp.250 <- b5.scores[1:N,]
b5.samp.1000 <- b5.scores[1:N1, ]
#now find the validities of these 4 sets of 5 scales
df250 <- cbind(spi[ss[1:N],1:10], f5.samp.250$scores, b5.samp.250)
df1000 <- cbind(spi[ss[1:N1],1:10], f5.samp.1000$scores, b5.samp.1000)
model.f5.250 <- lmCor(1:10,11:15, data=df250, plot=FALSE)
model.b5.250 <- lmCor(1:10,16:20, data=df250, plot=FALSE)
model.f5.1000 <- lmCor(1:10,11:15, data=df1000, plot=FALSE)
model.b5.1000 <- lmCor(1:10,16:20, data=df1000, plot=FALSE)
derivation <- data.frame(model.f5.250$R,model.b5.250$R,model.f5.1000$R,model.b5.1000$R)
#now, cross validate
f5s.scores.250 <- factor.scores(spi[ss[(N+1):lastss], selectB5],f5.samp.250)$scores #this finds scores for all cases based upon first 250
f5s.scores.1000 <- factor.scores(spi[ss[(N1+1):lastss],selectB5],f5.samp.1000)$scores #this finds scores for all cases based upon first 1000
b5s.scores.250 <- b5.scores[(N+1):lastss,] #they are already randomized
b5s.scores.1000 <- b5.scores[(N1+1):lastss,]
dfs.250 <- cbind(spi[ss[(N+1):lastss],1:10],f5s.scores.250,b5s.scores.250)
dfs.1000 <- cbind(spi[ss[(N1+1):lastss],1:10], f5s.scores.1000,b5s.scores.1000)
#but we want to use the derivation model when we cross validate, otherwise it is not cross validation
#use the crossValidation function
model.f5.250c <- crossValidation(model.f5.250, data=dfs.250)
model.b5.250c <- crossValidation(model.b5.250, data=dfs.250)
model.f5.1000c <- crossValidation(model.f5.1000, data=dfs.1000)
model.b5.1000c <- crossValidation(model.b5.1000, data=dfs.1000)
#combine the results
validation <- data.frame(cvf.250=model.f5.250c$crossV$items,
cvu.250=model.b5.250c$crossV$items,
cvf.1000=model.f5.1000c$crossV$items,
cvu.1000=model.b5.1000c$crossV$items)
big.list[[i]] <- cbind(derivation,validation)
}
#We have done the derivation and replications for 100 samples.
#now, take the 100 replications and summarize them
sum.df <- matrix(unlist(big.list),byrow=TRUE,ncol=80)
sum.df <- as.data.frame(sum.df)
dv <- cs(f250,u250,f1000,u1000,cf250,cu250,cf1000,cu1000)
dv8 <- rep(dv,each=10)
colnames(sum.df) <- paste0(rep(colnames(spi)[1:10],8),dv8)
dd <- describe(sum.df)
ord <- order(dd$mean[71:80],decreasing=FALSE)
#now, repeat this redordering for the 8 cases
#ord is now the order of the best to worst unit weighted cross validated values
#error.bars(sum.df[ord])
# error.bars(sum.df[ord+10],add=TRUE)
#organise for a table
sort.lab <- colnames(spi)[ord]
new.ord <- rep(ord,8) + rep(c(0,10,20,30,40,50, 60,70),each=10)
new.dd <- dd[new.ord,]
sort.lab20 <- c(sort.lab,sort.lab)
wide.dd <- data.frame(sort.lab20,new.dd[c(1:10,41:50),3:4], new.dd[c(11:20,51:60),3:4], new.dd[c(21:30,61:70),3:4],new.dd[c(31:40,71:80),3:4])
wide.dd <- data.frame(sort.lab20,new.dd[c(1:10,41:50),3:4], new.dd[c(11:20,51:60),3:4], new.dd[c(21:30,61:70),3:4],new.dd[c(31:40,71:80),3:4])
rn <- paste0(rep(cs(Derivation,Validation),each=10),colnames(spi)[ord])
rownames(wide.dd)<- rn
colnames(wide.dd )<- c("Variable","Mean F 250","SD F 250","Mean U 250", "SD U 250","Mean F 1000","SD F 1000",
"Mean U 1000", "SD U 1000")
#first draw the derivation R for the four conditions, then draw the cross validated R (separate graphs)
error.bars(sum.df[ord+30],add=FALSE,lty="solid",type="l",labels=sort.lab,col=rep("black",10 ),main="Multiple R for 10 criteria on derivation sample", xlab="", ylab="Cross validated R", ylim=c(0.05,.45) , sd=FALSE) #unit weighted 1000
## Warning in axis(1, 1:z, names, ...): graphical parameter "type" is obsolete
#factor scores 1000
error.bars(sum.df[ord+20],add=TRUE,lty="dashed",type="l",labels=sort.lab,col=rep("blue",10 ), ylim=c(0.05,.45), sd= FALSE)
#factror scores 250
error.bars(sum.df[ord],add=TRUE,lty="dotted",type="l",labels=sort.lab,col=rep("green",10 ),sd= FALSE)
#unit weighted 250
error.bars(sum.df[ord+10],add=TRUE,lty="dashed",type="l",col=rep("red",10), sd= FALSE)
#cross validafed R
error.bars(sum.df[ord+70],add=FALSE,lty="solid",type="l",labels=sort.lab,col=rep("black",10 ),main="Cross validated Multiple R for 10 criteria", xlab="", ylab="Cross validated R", ylim=c(0.0,.45) , sd=FALSE) #unit weighted 1000
## Warning in axis(1, 1:z, names, ...): graphical parameter "type" is obsolete
#factor scores 1000
error.bars(sum.df[ord+60],add=TRUE,lty="dashed",type="l",labels=sort.lab,col=rep("blue",10 ), ylim=c(0.00,.45), sd= FALSE)
#factor scores 250
error.bars(sum.df[ord+40],add=TRUE,lty="dotted",type="l",labels=sort.lab,col=rep("green",10 ),sd= FALSE)
#unit weighted 250
error.bars(sum.df[ord+50],add=TRUE,lty="dashed",type="l",col=rep("red",10), sd= FALSE)
legend("topleft",c( "Unit weighted 1000", "Unit weighed 250", "Factor scores 1000", "Factor scores 250"),
lty=cs(solid,dashed,dashed,dotted), col=cs(black,red,blue,green))