psych and psychTools have a few subtle modifications since Monday. The version numbers have not changed, but the dates have.

#install.packages("psych",repos="http://personality-project.org/r", type="source")
#install.packages("psychTools",repos="http://personality-project.org/r", type="source")

library(psych) #make it active
library(psychTools)
sessionInfo()   #psych should be 2.3.5 and psychTools should be 2.3.5

## R Under development (unstable) (2023-03-17 r83997)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## 
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] psychTools_2.3.5 psych_2.3.5     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.31   R6_2.5.1        fastmap_1.1.1   xfun_0.38       lattice_0.20-45 cachem_1.0.7   
##  [7] parallel_4.3.0  knitr_1.42      foreign_0.8-84  htmltools_0.5.5 rmarkdown_2.21  cli_3.6.1      
## [13] grid_4.3.0      sass_0.4.5      jquerylib_0.1.4 mnormt_2.1.1    compiler_4.3.0  rstudioapi_0.14
## [19] tools_4.3.0     nlme_3.1-162    evaluate_0.20   bslib_0.4.2     yaml_2.3.7      rlang_1.1.0    
## [25] jsonlite_1.8.4

packageDate("psych"); packageDate("psychTools") #should be >= "2023-05-09"

## [1] "2023-05-14"

## [1] "2023-05-09"

Constructing scales (Preface)

The typical measure in psychology, whether measuring attitudes, temperament, or ability is composed of multiple items. This is partly because our constructs are broad and we want to capture the various nuances of our constructs. It is also because the reliability of a sum ( or average) of a set of items is much higher than any single item Revelle and Condon, 2019 Reliability: from alpha to omega. See class notes on Classical test theory.

Thus, we form item composites known as scales. The various steps to do this are discussed in the How to scoring scales linked here and in the syllabus. That How to links a number of other pages that are meant to facilitate your work.

This Rmd and HTML files allow you to do each step yourself.

Get the data and do basic data cleaning.

Step 1: Collecting the data

For the first law of data analysis is you must have some data. Choose an appropriate subject population, write items that you (and others) think measure the constructs, and then give these items to your participants. Make sure that they are engaged in doing the task.

Step 2: Getting the data into `R`

Make psych active.

library(psych)
library(psychTools)

Core R functions:

file.name <- file.choose() my.data <- read.table(file.name)

psych functions.

`my.data <- read.file()’ #locate the file to read using your normal system.

or read from a remote file: e.g.,

 file.name <-
  "https://personality-project.org/r/psych/HowTo/scoring.tutorial/small.msq.txt"
my.data <- read.file(file.name)

## Data from the .txt file https://personality-project.org/r/psych/HowTo/scoring.tutorial/small.msq.txt has been loaded.

### checking basic qualities of the data

 dim(my.data)   #how many rows (subjects) and columns (variables)

## [1] 200  14

 headTail(my.data) #show the first and last few rows

##     active alert aroused sleepy tired drowsy anxious jittery nervous calm relaxed at.ease gender drug
## 1        2     2       0      0     1      1       1       1       0    2       2       2      2    2
## 2        1     1       0      2     2      2       1       0       1    2       2       2      2    1
## 3        1     1       1      3     3      2       2       0       1    2       2       2      1    1
## 4        1     2       0      1     1      0       0       0       0    2       2       2      2    2
## ...    ...   ...     ...    ...   ...    ...     ...     ...     ...  ...     ...     ...    ...  ...
## 197      0     0       0      3     2      2       1       0       0    1       2       1   <NA> <NA>
## 198      2     0       1      3     3      3       0       0       1    2       3       3   <NA> <NA>
## 199      1     1       1      1     1      0       0       0       0    3       3       3   <NA> <NA>
## 200      0     1       0      3     3      3       0       0       0    3       2       1   <NA> <NA>

 describe(my.data) #basic descriptive statistics

##         vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
## active     1 200 0.86 0.84    1.0    0.78 1.48   0   3     3  0.62    -0.45 0.06
## alert      2 199 0.97 0.85    1.0    0.90 1.48   0   3     3  0.54    -0.42 0.06
## aroused    3 200 0.52 0.76    0.0    0.38 0.00   0   3     3  1.30     0.92 0.05
## sleepy     4 198 1.60 1.08    2.0    1.62 1.48   0   3     3 -0.05    -1.31 0.08
## tired      5 200 1.78 1.01    2.0    1.86 1.48   0   3     3 -0.30    -1.04 0.07
## drowsy     6 200 1.50 1.08    1.0    1.49 1.48   0   3     3  0.02    -1.28 0.08
## anxious    7 200 0.60 0.83    0.0    0.45 0.00   0   3     3  1.31     1.05 0.06
## jittery    8 200 0.66 0.87    0.0    0.50 0.00   0   3     3  1.22     0.63 0.06
## nervous    9 200 0.44 0.77    0.0    0.25 0.00   0   3     3  1.81     2.63 0.05
## calm      10 200 1.55 0.91    2.0    1.57 1.48   0   3     3 -0.06    -0.80 0.06
## relaxed   11 199 1.67 0.86    2.0    1.71 1.48   0   3     3 -0.20    -0.63 0.06
## at.ease   12 198 1.40 0.96    1.0    1.38 1.48   0   3     3  0.08    -0.96 0.07
## gender    13 136 1.54 0.50    2.0    1.55 0.00   1   2     1 -0.18    -1.98 0.04
## drug      14 136 1.50 0.50    1.5    1.50 0.74   1   2     1  0.00    -2.01 0.04

The first 12 variables are clearly affective/mood variables. Should we form one, or two or many scales from these data? On solution is to examine the data in terms of their dimensionality. fa will do a factor analysis.

We can compare a 1 factor solution to a 2 factor solution.

 f1 <- fa(my.data[,1:12]) #just look at the mood variables
 f1

## Factor Analysis using method =  minres
## Call: fa(r = my.data[, 1:12])
## Standardized loadings (pattern matrix) based upon correlation matrix
##           MR1    h2   u2 com
## active   0.53 0.282 0.72   1
## alert    0.64 0.414 0.59   1
## aroused  0.55 0.304 0.70   1
## sleepy  -0.71 0.499 0.50   1
## tired   -0.65 0.421 0.58   1
## drowsy  -0.65 0.416 0.58   1
## anxious  0.53 0.281 0.72   1
## jittery  0.61 0.375 0.62   1
## nervous  0.46 0.208 0.79   1
## calm    -0.37 0.138 0.86   1
## relaxed -0.29 0.087 0.91   1
## at.ease -0.29 0.084 0.92   1
## 
##                 MR1
## SS loadings    3.51
## Proportion Var 0.29
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 factor is sufficient.
## 
## df null model =  66  with the objective function =  6.17 with Chi Square =  1198.51
## df of  the model are 54  and the objective function was  3.65 
## 
## The root mean square of the residuals (RMSR) is  0.21 
## The df corrected root mean square of the residuals is  0.23 
## 
## The harmonic n.obs is  199 with the empirical chi square  1119.48  with prob <  5.9e-199 
## The total n.obs was  200  with Likelihood Chi Square =  707.08  with prob <  1.4e-114 
## 
## Tucker Lewis Index of factoring reliability =  0.293
## RMSEA index =  0.246  and the 90 % confidence intervals are  0.231 0.263
## BIC =  420.97
## Fit based upon off diagonal values = 0.66
## Measures of factor score adequacy             
##                                                    MR1
## Correlation of (regression) scores with factors   0.92
## Multiple R square of scores with factors          0.85
## Minimum correlation of possible factor scores     0.69

Compare this to a two factor solution

 f2 <- fa(my.data[,1:12],2)  #specify that we want two factors

## Loading required namespace: GPArotation

f2

## Factor Analysis using method =  minres
## Call: fa(r = my.data[, 1:12], nfactors = 2)
## Standardized loadings (pattern matrix) based upon correlation matrix
##           MR1   MR2   h2   u2 com
## active   0.65  0.07 0.41 0.59 1.0
## alert    0.73  0.00 0.53 0.47 1.0
## aroused  0.52 -0.13 0.30 0.70 1.1
## sleepy  -0.77  0.06 0.60 0.40 1.0
## tired   -0.79 -0.05 0.61 0.39 1.0
## drowsy  -0.76 -0.03 0.58 0.42 1.0
## anxious  0.25 -0.54 0.39 0.61 1.4
## jittery  0.34 -0.54 0.46 0.54 1.7
## nervous  0.13 -0.62 0.43 0.57 1.1
## calm     0.01  0.72 0.52 0.48 1.0
## relaxed  0.12  0.75 0.55 0.45 1.0
## at.ease  0.13  0.76 0.56 0.44 1.1
## 
##                        MR1  MR2
## SS loadings           3.25 2.70
## Proportion Var        0.27 0.22
## Cumulative Var        0.27 0.50
## Proportion Explained  0.55 0.45
## Cumulative Proportion 0.55 1.00
## 
##  With factor correlations of 
##       MR1   MR2
## MR1  1.00 -0.15
## MR2 -0.15  1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## df null model =  66  with the objective function =  6.17 with Chi Square =  1198.51
## df of  the model are 43  and the objective function was  1.65 
## 
## The root mean square of the residuals (RMSR) is  0.09 
## The df corrected root mean square of the residuals is  0.11 
## 
## The harmonic n.obs is  199 with the empirical chi square  223.16  with prob <  3.6e-26 
## The total n.obs was  200  with Likelihood Chi Square =  319.12  with prob <  7.6e-44 
## 
## Tucker Lewis Index of factoring reliability =  0.623
## RMSEA index =  0.179  and the 90 % confidence intervals are  0.161 0.198
## BIC =  91.29
## Fit based upon off diagonal values = 0.93
## Measures of factor score adequacy             
##                                                    MR1  MR2
## Correlation of (regression) scores with factors   0.94 0.92
## Multiple R square of scores with factors          0.87 0.84
## Minimum correlation of possible factor scores     0.75 0.69

 fa.plot(f2, labels = colnames(my.data)[1:12])

## Create scoring keys for a 1 and 2 factor solution

In order to score the data, we need to specify which items we want to use and in which direction we want to score them.

We can do this by hand, or use a simple function. Lets do it by hand. We will call the two factor solution energy and tension.

keys <- list(onefactor = cs(active,alert, aroused, -sleepy,-tired, -drowsy,anxious, jittery,nervous, -calm,-relaxed, -at.ease),
        energy=cs(active,alert, aroused, -sleepy,-tired, -drowsy),
        tension =cs(anxious, jittery,nervous, -calm,-relaxed, -at.ease))
keys

## $onefactor
##  [1] "active"   "alert"    "aroused"  "-sleepy"  "-tired"   "-drowsy"  "anxious"  "jittery"  "nervous" 
## [10] "-calm"    "-relaxed" "-at.ease"
## 
## $energy
## [1] "active"  "alert"   "aroused" "-sleepy" "-tired"  "-drowsy"
## 
## $tension
## [1] "anxious"  "jittery"  "nervous"  "-calm"    "-relaxed" "-at.ease"

We can use these keys to score the items. There are several functions that will do this. scoreFast just finds the mean for each set of items. scoreItems does this, but also report various statistics. scoreOverlap will not find scores, but will find the correlations of the scales correcting for item overlap.

For each of these functions, we specify the scoring keys information.

simple.scores <- scoreFast(keys,my.data)
dim(simple.scores)

## [1] 200   3

scales <- scoreItems(keys,my.data) #more complicated output
scales   #does not correct for overlapping items

## Call: scoreItems(keys = keys, items = my.data)
## 
## (Unstandardized) Alpha:
##       onefactor energy tension
## alpha      0.82   0.85    0.83
## 
## Standard errors of unstandardized Alpha:
##       onefactor energy tension
## ASE       0.027  0.033   0.037
## 
## Average item correlation:
##           onefactor energy tension
## average.r      0.28   0.49    0.44
## 
## Median item correlation:
## onefactor    energy   tension 
##      0.27      0.48      0.40 
## 
##  Guttman 6* reliability: 
##          onefactor energy tension
## Lambda.6      0.89   0.89    0.85
## 
## Signal/Noise based upon av.r : 
##              onefactor energy tension
## Signal/Noise       4.6    5.9     4.7
## 
## Scale intercorrelations corrected for attenuation 
##  raw correlations below the diagonal, alpha on the diagonal 
##  corrected correlations above the diagonal:
##           onefactor energy tension
## onefactor      0.82   0.97    0.91
## energy         0.81   0.85    0.25
## tension        0.75   0.21    0.83
## 
##  Average adjusted correlations within and between scales (MIMS)
##           onfct enrgy tensn
## onefactor 0.28             
## energy    0.28  0.49       
## tension   0.22  0.10  0.44 
## 
##  Average adjusted item x scale correlations within and between scales (MIMT)
##           onfct enrgy tensn
## onefactor 0.58             
## energy    0.61  0.76       
## tension   0.55  0.16  0.73 
## 
##  In order to see the item by scale loadings and frequency counts of the data
##  print with the short option = FALSE

names(scales)  #many objects are returned

##  [1] "scores"         "missing"        "alpha"          "av.r"           "sn"             "n.items"       
##  [7] "item.cor"       "cor"            "corrected"      "G6"             "item.corrected" "response.freq" 
## [13] "raw"            "ase"            "med.r"          "keys"           "MIMS"           "MIMT"          
## [19] "Call"

scales.scores <- scales$scores  # get the scores  

cor2(simple.scores,scales.scores)

##             onefactor energy tension
## onefactor-A      1.00   0.81    0.74
## energy-A         0.81   1.00    0.21
## tension-A        0.75   0.22    1.00

overlap <- scoreOverlap(keys, my.data)   #corrects for item overlap
overlap

## Call: scoreOverlap(keys = keys, r = my.data)
## 
## (Standardized) Alpha:
## onefactor    energy   tension 
##      0.82      0.85      0.83 
## 
## (Standardized) G6*:
## onefactor    energy   tension 
##      0.74      0.68      0.68 
## 
## Average item correlation:
## onefactor    energy   tension 
##      0.28      0.49      0.44 
## 
## Median item correlation:
## onefactor    energy   tension 
##      0.28      0.49      0.40 
## 
## Number of items:
## onefactor    energy   tension 
##        12         6         6 
## 
## Signal to Noise ratio based upon average r and n 
## onefactor    energy   tension 
##       4.7       5.8       4.8 
## 
## Scale intercorrelations corrected for item overlap and attenuation 
##  adjusted for overlap correlations below the diagonal, alpha on the diagonal 
##  corrected correlations above the diagonal:
##           onefactor energy tension
## onefactor      0.82   0.80    0.78
## energy         0.67   0.85    0.26
## tension        0.65   0.22    0.83
## 
##  Percentage of keyed items with highest absolute correlation with scale  (scale quality)
## onefactor    energy   tension 
##     0.083     1.000     0.833 
## 
##  Average adjusted correlations within and between scales (MIMS)
##           onfct enrgy tensn
## onefactor 0.28             
## energy    0.30  0.49       
## tension   0.28  0.12  0.44 
## 
##  Average adjusted item x scale correlations within and between scales (MIMT)
##           onfct enrgy tensn
## onefactor 0.60             
## energy    0.63  0.81       
## tension   0.58  0.20  0.75 
## 
##  In order to see the item by scale loadings and frequency counts of the data
##  print with the short option = FALSE

names(overlap)

##  [1] "cor"       "sd"        "corrected" "alpha"     "av.r"      "size"      "sn"        "G6"       
##  [9] "item.cor"  "med.r"     "quality"   "MIMS"      "MIMT"      "Call"

Compare the scale correlations from scales and overlap

scales$cor   #the cor object of scales

##           onefactor    energy   tension
## onefactor 1.0000000 0.8090324 0.7457055
## energy    0.8090324 1.0000000 0.2116871
## tension   0.7457055 0.2116871 1.0000000

overlap$cor   #the cor object of overlap

##           onefactor    energy   tension
## onefactor 1.0000000 0.6719402 0.6464668
## energy    0.6719402 1.0000000 0.2216840
## tension   0.6464668 0.2216840 1.0000000

lowerCor(scales$scores) #the observed correlations

##           onfct enrgy tensn
## onefactor 1.00             
## energy    0.81  1.00       
## tension   0.75  0.21  1.00

Examine the effect of drug on these scales

We will join the scales.scores output with the original data (I am using cbind to get around a problem with vJoin)

scales.drugs <- cbind(scales.scores,my.data[,13:14])
describe(scales.drugs)

##           vars   n mean   sd median trimmed  mad min  max range  skew kurtosis   se
## onefactor    1 200 1.04 0.53   1.00    1.00 0.49   0 2.83  2.83  0.85     1.12 0.04
## energy       2 200 1.08 0.72   1.08    1.05 0.86   0 3.00  3.00  0.36    -0.43 0.05
## tension      3 200 1.01 0.63   1.00    0.96 0.49   0 3.00  3.00  0.88     0.81 0.04
## gender       4 136 1.54 0.50   2.00    1.55 0.00   1 2.00  1.00 -0.18    -1.98 0.04
## drug         5 136 1.50 0.50   1.50    1.50 0.74   1 2.00  1.00  0.00    -2.01 0.04

cd <-cohen.d(scales.drugs ~ drug)
error.dots(cd)

Although the energy and tensions scales are independent, they are both affected by the drug manipulation.

We can use a technique known as factor extension to see how the drug effect fits into the two space of the items.

Factor extension was originally developed to handle the case of variables not measured in the original factor analysis.

The fa.extend function allows us to do this.

f2e <- fa.extend(my.data, 2, ov =1:12, ev =14)
f2e   #show it

## Factor Analysis using method =  minres
## Call: fa.extend(r = my.data, nfactors = 2, ov = 1:12, ev = 14)
## Standardized loadings (pattern matrix) based upon correlation matrix
##           MR1   MR2   h2   u2 com
## active   0.65  0.07 0.41 0.59 1.0
## alert    0.73  0.00 0.53 0.47 1.0
## aroused  0.52 -0.13 0.30 0.70 1.1
## sleepy  -0.77  0.06 0.60 0.40 1.0
## tired   -0.79 -0.05 0.61 0.39 1.0
## drowsy  -0.76 -0.03 0.58 0.42 1.0
## anxious  0.25 -0.54 0.39 0.61 1.4
## jittery  0.34 -0.54 0.46 0.54 1.7
## nervous  0.13 -0.62 0.43 0.57 1.1
## calm     0.01  0.72 0.52 0.48 1.0
## relaxed  0.12  0.75 0.55 0.45 1.0
## at.ease  0.13  0.76 0.56 0.44 1.1
## drug     0.40 -0.19 0.22 0.78 1.4
## 
##                        MR1  MR2
## SS loadings           3.43 2.74
## Proportion Var        0.26 0.21
## Cumulative Var        0.26 0.47
## Proportion Explained  0.56 0.44
## Cumulative Proportion 0.56 1.00
## 
##  With factor correlations of 
##       MR1   MR2
## MR1  1.00 -0.15
## MR2 -0.15  1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## df null model =  78  with the objective function =  6.46 with Chi Square =  1251.6
## df of  the model are 53  and the objective function was  1.71 
## 
## The root mean square of the residuals (RMSR) is  0.09 
## The df corrected root mean square of the residuals is  0.11 
## 
## The harmonic n.obs is  186 with the empirical chi square  228.55  with prob <  1.2e-23 
## The total n.obs was  200  with Likelihood Chi Square =  330.08  with prob <  1.1e-41 
## 
## Tucker Lewis Index of factoring reliability =  0.65
## RMSEA index =  0.162  and the 90 % confidence intervals are  0.146 0.179
## BIC =  49.27
## Fit based upon off diagonal values = 0.94
## Measures of factor score adequacy             
##                                                    MR1  MR2
## Correlation of (regression) scores with factors   0.94 0.92
## Multiple R square of scores with factors          0.88 0.85
## Minimum correlation of possible factor scores     0.76 0.69

fa.plot(f2e, labels = colnames(my.data))   #plot it

diagram(f2e, ic=TRUE ) #another way to show it

Scale validity: do they measure what they should

a diversion reliability and validity

First a little function

"bullseye" <- function(x,y,n) {
for(i in 1:n) {dia.ellipse(x,y,e.size=i)}}


"rel.val" <- function(n,sdx=.2,bars=TRUE,arrow.len=.05) {
if(n>20) {pc <- "."} else {pc <- 16}

plot(NA,xlim=c(0,10),ylim=c(0,10),axes=FALSE,xlab="",ylab="",main="Reliability and Validity as target shooting")
#Reliable and valid
x=3
y=2
bullseye(x,y,4)
x1 <- x + rnorm(n,0,sdx)
y1 <- y + rnorm(n,0,sdx)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Reliable and valid")

#unReliable and  invalid
x=7
y=7
bullseye(x,y,4)
x1 <- x + rnorm(n,1,1)
y1 <- y + rnorm(n,1,1)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Unreliable and Invalid")

#reliable and invalid
x=7
y=2
bullseye(x,y,4)
x1 <- x + rnorm(n,1,sdx)
y1 <- y + rnorm(n,1,sdx)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars)error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")

text(x,y-2,"Reliable and Invalid")


#unreliable, but valid
x=3
y=7

bullseye(x,y,4)
x1 <- x + rnorm(n,0,1)
y1 <- y + rnorm(n,0,1)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Unreliable but Valid")
}

now use thid function to show reliability and validity

rel.val(6)

Validity of Big 5 scores with 10 criteria

The spi data set has 135 personality items to measure the big few as well as a lower level set of 27 lower level factors. There are also 10 criteria.

We can score the Big 5 scales and then use these to predict criteria using linear regression.

First score the scales

keys <- spi.keys[1:5] #just use the big 5 keys
b5 <- scoreFast(keys,spi)
# use them to predict criteria
b5.crit <- vJoin(spi[1:10], b5)
colnames(b5.crit)

##  [1] "age"       "sex"       "health"    "p1edu"     "p2edu"     "education" "wellness"  "exer"     
##  [9] "smoke"     "ER"        "Agree-A"   "Consc-A"   "Neuro-A"   "Extra-A"   "Open-A"

Now apply a linear regression, predicting the first 10 variables from the big 5 scores. Do not show the plot of the results.

model <- lmCor(y=1:10,x=11:15, data=b5.crit, plot=FALSE)
summary(model)

## 
## Multiple Regression from raw data 
## lmCor(y = 1:10, x = 11:15, data = b5.crit, plot = FALSE)
## 
## Multiple Regression from matrix input 
## 
## Beta weights 
##               age    sex  health  p1edu  p2edu education wellness    exer  smoke     ER
## (Intercept)  0.00  0.000  0.0000  0.000  0.000     0.000   0.0000  0.0000  0.000  0.000
## Agree-A      0.16  0.162  0.0063  0.015  0.014     0.116   0.0631 -0.0053 -0.083 -0.025
## Consc-A      0.13  0.103  0.1715 -0.034 -0.049     0.065   0.1053  0.1613 -0.082  0.016
## Neuro-A     -0.14  0.286 -0.2721 -0.036 -0.033    -0.147   0.0302 -0.1247  0.058  0.131
## Extra-A     -0.11  0.086  0.1436  0.047  0.061    -0.086   0.0918  0.0876  0.084  0.050
## Open-A       0.12 -0.122  0.0126  0.058  0.057     0.142   0.0031  0.0675  0.090 -0.012
## 
## Multiple R 
##       age       sex    health     p1edu     p2edu education  wellness      exer     smoke        ER 
##     0.306     0.360     0.405     0.098     0.109     0.264     0.170     0.267     0.181     0.133 
## 
## Multiple R2 
##       age       sex    health     p1edu     p2edu education  wellness      exer     smoke        ER 
##    0.0939    0.1296    0.1642    0.0096    0.0118    0.0699    0.0288    0.0711    0.0329    0.0176 
## 
## Cohen's set correlation R2 
## [1] 0.4
## 
## Squared Canonical Correlations
## [1] 0.2394 0.1332 0.0620 0.0298 0.0079 0.0000

Organize the results in a better way. Sort by R2

ord <- order(model$R2)
model.s <- lmCor(y=ord, x=11:15,data=b5.crit,plot=FALSE)
summary(model.s)

## 
## Multiple Regression from raw data 
## lmCor(y = ord, x = 11:15, data = b5.crit, plot = FALSE)
## 
## Multiple Regression from matrix input 
## 
## Beta weights 
##              p1edu  p2edu     ER wellness  smoke education    exer   age    sex  health
## (Intercept)  0.000  0.000  0.000   0.0000  0.000     0.000  0.0000  0.00  0.000  0.0000
## Agree-A      0.015  0.014 -0.025   0.0631 -0.083     0.116 -0.0053  0.16  0.162  0.0063
## Consc-A     -0.034 -0.049  0.016   0.1053 -0.082     0.065  0.1613  0.13  0.103  0.1715
## Neuro-A     -0.036 -0.033  0.131   0.0302  0.058    -0.147 -0.1247 -0.14  0.286 -0.2721
## Extra-A      0.047  0.061  0.050   0.0918  0.084    -0.086  0.0876 -0.11  0.086  0.1436
## Open-A       0.058  0.057 -0.012   0.0031  0.090     0.142  0.0675  0.12 -0.122  0.0126
## 
## Multiple R 
##     p1edu     p2edu        ER  wellness     smoke education      exer       age       sex    health 
##     0.098     0.109     0.133     0.170     0.181     0.264     0.267     0.306     0.360     0.405 
## 
## Multiple R2 
##     p1edu     p2edu        ER  wellness     smoke education      exer       age       sex    health 
##    0.0096    0.0118    0.0176    0.0288    0.0329    0.0699    0.0711    0.0939    0.1296    0.1642 
## 
## Cohen's set correlation R2 
## [1] 0.4
## 
## Squared Canonical Correlations
## [1] 0.2394 0.1332 0.0620 0.0298 0.0079 0.0000

Cross valdate the predictions

Any data analytic technique will best fit the observed data. But what happens when we apply it to a new data set? This is the process of cross validation.

One typical technique is to split the sample into 2 parts and derive on one part and then apply the model to the second part.

The crossValidation function will help us do this.

First, we divide the sample into two random parts, derive the model on the first part, and then apply it to the second part.

n.obs <- NROW(spi)
set.seed(42)  #for reprodicibile results
ss <- sample(n.obs, n.obs/2) #sampling without replacement
derivation.model <- model <- lmCor(y=ord,x=11:15, data=b5.crit[ss,], plot=FALSE)
summary(derivation.model)

## 
## Multiple Regression from raw data 
## lmCor(y = ord, x = 11:15, data = b5.crit[ss, ], plot = FALSE)
## 
## Multiple Regression from matrix input 
## 
## Beta weights 
##              p1edu  p2edu      ER wellness  smoke education   exer   age    sex health
## (Intercept)  0.000  0.000  0.0000    0.000  0.000     0.000  0.000  0.00  0.000  0.000
## Agree-A      0.021  0.018 -0.0176    0.054 -0.071     0.103 -0.020  0.14  0.161 -0.037
## Consc-A     -0.061 -0.067  0.0091    0.081 -0.095     0.062  0.194  0.12  0.088  0.166
## Neuro-A     -0.016 -0.029  0.1377    0.036  0.066    -0.155 -0.145 -0.13  0.299 -0.263
## Extra-A      0.053  0.072  0.0417    0.084  0.077    -0.085  0.095 -0.10  0.088  0.163
## Open-A       0.072  0.078 -0.0259    0.028  0.082     0.142  0.079  0.14 -0.118  0.015
## 
## Multiple R 
##     p1edu     p2edu        ER  wellness     smoke education      exer       age       sex    health 
##      0.11      0.13      0.14      0.15      0.18      0.26      0.31      0.30      0.36      0.39 
## 
## Multiple R2 
##     p1edu     p2edu        ER  wellness     smoke education      exer       age       sex    health 
##     0.013     0.018     0.020     0.021     0.033     0.069     0.093     0.088     0.132     0.155 
## 
## Cohen's set correlation R2 
## [1] 0.41
## 
## Squared Canonical Correlations
## [1] 0.2455 0.1230 0.0638 0.0388 0.0095 0.0000

#now cross validate
cross.v <- crossValidation(derivation.model, data=b5.crit[-ss,])
summary(cross.v)

## Cross Validation
## Call:crossValidation(model = derivation.model, data = b5.crit[-ss, 
##     ])
## 
## Validities from raw items and from the correlation matrix
## Number of unique predictors used =  5 
##           items   mat
## p1edu     0.072 0.072
## p2edu     0.082 0.082
## ER        0.122 0.122
## wellness  0.189 0.187
## smoke     0.179 0.180
## education 0.262 0.267
## exer      0.227 0.226
## age       0.313 0.313
## sex       0.355 0.355
## health    0.408 0.411

Those are pretty good. But what would happen if we had smaller samples for the derivation set?

Try 100 samples of size 250

Widaman and Revelle (2023) have discussed this problem in a comparison with factor scores. Here we take their code and apply it just to the regression problem.

selectB5 <- selectFromKeys(spi.keys[1:5])

b5 <- scoreItems(spi.keys[1:5], spi) 


#Create samples of 250 and 1000 
#find factor scores and unit weighted scores
#use these to find multiple R
#then, use the empirical factor weights from each sample to cross validate to the rest
big.list <- list()
N=250
N1=1000
set.seed(42)   #this makes for reproducible results
n.iter<- 100
for (i in 1: n.iter) {
ss <- sample(NROW(spi),NROW(spi))
lastss <- length(ss)
f5.samp.250  <-  fa(spi[ss[1:N],selectB5],5) # based upon the first 250
f5.samp.1000 <- fa(spi[ss[1:N1],selectB5],5) # based upon the first 1000
b5.scores <- scoreItems(spi.keys[1:5],spi[ss,])$scores
b5.samp.250 <-  b5.scores[1:N,]
b5.samp.1000 <-  b5.scores[1:N1, ]

#now find the validities of these 4 sets of 5 scales
df250 <- cbind(spi[ss[1:N],1:10],  f5.samp.250$scores,        b5.samp.250)
df1000 <- cbind(spi[ss[1:N1],1:10], f5.samp.1000$scores,  b5.samp.1000)

model.f5.250 <- lmCor(1:10,11:15, data=df250, plot=FALSE)
model.b5.250 <- lmCor(1:10,16:20, data=df250, plot=FALSE)
model.f5.1000 <- lmCor(1:10,11:15, data=df1000, plot=FALSE)
model.b5.1000 <- lmCor(1:10,16:20, data=df1000, plot=FALSE)

derivation <- data.frame(model.f5.250$R,model.b5.250$R,model.f5.1000$R,model.b5.1000$R)

#now, cross validate

f5s.scores.250 <- factor.scores(spi[ss[(N+1):lastss], selectB5],f5.samp.250)$scores   #this finds scores for all cases based upon first 250 
f5s.scores.1000 <-  factor.scores(spi[ss[(N1+1):lastss],selectB5],f5.samp.1000)$scores   #this finds scores for all cases based upon first 1000 
b5s.scores.250 <-  b5.scores[(N+1):lastss,]    #they are already randomized
b5s.scores.1000 <- b5.scores[(N1+1):lastss,]

dfs.250 <- cbind(spi[ss[(N+1):lastss],1:10],f5s.scores.250,b5s.scores.250)
dfs.1000 <- cbind(spi[ss[(N1+1):lastss],1:10], f5s.scores.1000,b5s.scores.1000)


#but we want to use the derivation model when we cross validate, otherwise it is not cross validation
#use the crossValidation function 

model.f5.250c <- crossValidation(model.f5.250, data=dfs.250)
model.b5.250c <- crossValidation(model.b5.250, data=dfs.250)
model.f5.1000c <- crossValidation(model.f5.1000, data=dfs.1000)
model.b5.1000c <- crossValidation(model.b5.1000, data=dfs.1000)

#combine the results
validation <- data.frame(cvf.250=model.f5.250c$crossV$items,
            cvu.250=model.b5.250c$crossV$items,
            cvf.1000=model.f5.1000c$crossV$items,
            cvu.1000=model.b5.1000c$crossV$items)

big.list[[i]] <- cbind(derivation,validation)
}

organize these results

#We have done the derivation and replications for 100 samples.
#now, take the 100 replications and summarize them

sum.df <- matrix(unlist(big.list),byrow=TRUE,ncol=80)
sum.df <- as.data.frame(sum.df)
dv <- cs(f250,u250,f1000,u1000,cf250,cu250,cf1000,cu1000)
dv8 <- rep(dv,each=10)
colnames(sum.df) <- paste0(rep(colnames(spi)[1:10],8),dv8)


 dd <- describe(sum.df)
 ord <- order(dd$mean[71:80],decreasing=FALSE)
 
 #now, repeat this redordering for the 8 cases
 #ord is now the order of the best to worst unit weighted cross validated values
 #error.bars(sum.df[ord])
# error.bars(sum.df[ord+10],add=TRUE)
 
 #organise for a table
sort.lab <- colnames(spi)[ord]

new.ord <- rep(ord,8) + rep(c(0,10,20,30,40,50, 60,70),each=10)
new.dd <- dd[new.ord,]
sort.lab20 <- c(sort.lab,sort.lab)
wide.dd <- data.frame(sort.lab20,new.dd[c(1:10,41:50),3:4], new.dd[c(11:20,51:60),3:4], new.dd[c(21:30,61:70),3:4],new.dd[c(31:40,71:80),3:4])
wide.dd <- data.frame(sort.lab20,new.dd[c(1:10,41:50),3:4], new.dd[c(11:20,51:60),3:4], new.dd[c(21:30,61:70),3:4],new.dd[c(31:40,71:80),3:4])
rn <- paste0(rep(cs(Derivation,Validation),each=10),colnames(spi)[ord])
rownames(wide.dd)<- rn
colnames(wide.dd )<- c("Variable","Mean F 250","SD F 250","Mean U 250", "SD U 250","Mean F 1000","SD F 1000",
 "Mean U 1000", "SD U 1000")

#first draw the derivation R for the four conditions, then draw the cross validated R (separate graphs)
error.bars(sum.df[ord+30],add=FALSE,lty="solid",type="l",labels=sort.lab,col=rep("black",10 ),main="Multiple R for 10 criteria on derivation sample", xlab="", ylab="Cross validated R", ylim=c(0.05,.45) , sd=FALSE) #unit weighted 1000

## Warning in axis(1, 1:z, names, ...): graphical parameter "type" is obsolete

#factor scores 1000
error.bars(sum.df[ord+20],add=TRUE,lty="dashed",type="l",labels=sort.lab,col=rep("blue",10 ), ylim=c(0.05,.45), sd= FALSE)
#factror scores 250
error.bars(sum.df[ord],add=TRUE,lty="dotted",type="l",labels=sort.lab,col=rep("green",10 ),sd= FALSE)
#unit weighted 250
error.bars(sum.df[ord+10],add=TRUE,lty="dashed",type="l",col=rep("red",10), sd= FALSE)

#cross validafed R

error.bars(sum.df[ord+70],add=FALSE,lty="solid",type="l",labels=sort.lab,col=rep("black",10 ),main="Cross validated Multiple R for 10 criteria", xlab="", ylab="Cross validated R", ylim=c(0.0,.45) , sd=FALSE) #unit weighted 1000

## Warning in axis(1, 1:z, names, ...): graphical parameter "type" is obsolete

#factor scores 1000
error.bars(sum.df[ord+60],add=TRUE,lty="dashed",type="l",labels=sort.lab,col=rep("blue",10 ), ylim=c(0.00,.45), sd= FALSE)
#factor scores 250
error.bars(sum.df[ord+40],add=TRUE,lty="dotted",type="l",labels=sort.lab,col=rep("green",10 ),sd= FALSE)
#unit weighted 250
error.bars(sum.df[ord+50],add=TRUE,lty="dashed",type="l",col=rep("red",10), sd= FALSE)

 legend("topleft",c( "Unit weighted 1000", "Unit weighed 250", "Factor scores 1000", "Factor scores 250"),
      lty=cs(solid,dashed,dashed,dotted), col=cs(black,red,blue,green))

350.scale construction

William Revelle

2023-05-15

Constructing scales (Preface)

Get the data and do basic data cleaning.

Step 1: Collecting the data

Step 2: Getting the data into `R`

Compare the scale correlations from scales and overlap

Examine the effect of drug on these scales

Scale validity: do they measure what they should

a diversion reliability and validity

now use thid function to show reliability and validity

Validity of Big 5 scores with 10 criteria

First score the scales

Cross valdate the predictions

Try 100 samples of size 250

organize these results

350.scale construction

William Revelle

2023-05-15

Constructing scales (Preface)

Get the data and do basic data cleaning.

Step 1: Collecting the data

Step 2: Getting the data into R

Compare the scale correlations from scales and overlap

Examine the effect of drug on these scales

Scale validity: do they measure what they should

a diversion reliability and validity

now use thid function to show reliability and validity

Validity of Big 5 scores with 10 criteria

First score the scales

Cross valdate the predictions

Try 100 samples of size 250

organize these results

Step 2: Getting the data into `R`