---
title: "350.scale construction"
author: "William Revelle"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=110)
```
`psych` and `psychTools` have a few subtle modifications since Monday. The version numbers have not changed, but the dates have.
```{r}
#install.packages("psych",repos="http://personality-project.org/r", type="source")
#install.packages("psychTools",repos="http://personality-project.org/r", type="source")
library(psych) #make it active
library(psychTools)
sessionInfo() #psych should be 2.3.5 and psychTools should be 2.3.5
packageDate("psych"); packageDate("psychTools") #should be >= "2023-05-09"
```
# Constructing scales (Preface)
The typical measure in psychology, whether measuring attitudes, temperament, or ability is composed of multiple items. This is partly because our constructs are broad and we want to capture the various nuances of our constructs. It is also because the reliability of a sum ( or average) of a set of items is much higher than any single item `Revelle and Condon, 2019` Reliability: from alpha to omega. See class notes on Classical test theory.
Thus, we form item composites known as scales. The various steps to do this are discussed in the `How to` scoring scales linked here and in the syllabus. That `How to` links a number of other pages that are meant to facilitate your work.
This Rmd and HTML files allow you to do each step yourself.
# Get the data and do basic data cleaning.
## Step 1: Collecting the data
For the first law of data analysis is you must have some data. Choose an appropriate subject population, write items that you (and others) think measure the constructs, and then give these items to your participants. Make sure that they are engaged in doing the
task.
## Step 2: Getting the data into `R`
Make `psych` active.
```{r}
library(psych)
library(psychTools)
```
Core R functions:
`file.name <- file.choose()
my.data <- read.table(file.name)
`
`psych` functions.
`my.data <- read.file()' #locate the file to read using your normal system.
or read from a remote file: e.g.,
```{r}
file.name <-
"https://personality-project.org/r/psych/HowTo/scoring.tutorial/small.msq.txt"
my.data <- read.file(file.name)
```
### checking basic qualities of the data
```{r}
dim(my.data) #how many rows (subjects) and columns (variables)
headTail(my.data) #show the first and last few rows
describe(my.data) #basic descriptive statistics
```
The first 12 variables are clearly affective/mood variables. Should we form one, or two or many scales from these data? On solution is to examine the data in terms of their dimensionality. `fa` will do a factor analysis.
We can compare a 1 factor solution to a 2 factor solution.
```{r one factor}
f1 <- fa(my.data[,1:12]) #just look at the mood variables
f1
```
Compare this to a two factor solution
```{r two factor}
f2 <- fa(my.data[,1:12],2) #specify that we want two factors
f2
fa.plot(f2, labels = colnames(my.data)[1:12])
```
## Create scoring keys for a 1 and 2 factor solution
In order to score the data, we need to specify which items we want to use and in which direction we want to score them.
We can do this by hand, or use a simple function.
Lets do it by hand. We will call the two factor solution energy and tension.
```{r}
keys <- list(onefactor = cs(active,alert, aroused, -sleepy,-tired, -drowsy,anxious, jittery,nervous, -calm,-relaxed, -at.ease),
energy=cs(active,alert, aroused, -sleepy,-tired, -drowsy),
tension =cs(anxious, jittery,nervous, -calm,-relaxed, -at.ease))
keys
```
We can use these keys to score the items. There are several functions that will do this. `scoreFast` just finds the mean for each set of items. `scoreItems` does this, but also report various statistics. `scoreOverlap` will not find scores, but will find the correlations of the scales correcting for item overlap.
For each of these functions, we specify the scoring `keys` information.
```{r}
simple.scores <- scoreFast(keys,my.data)
dim(simple.scores)
scales <- scoreItems(keys,my.data) #more complicated output
scales #does not correct for overlapping items
names(scales) #many objects are returned
scales.scores <- scales$scores # get the scores
cor2(simple.scores,scales.scores)
overlap <- scoreOverlap(keys, my.data) #corrects for item overlap
overlap
names(overlap)
```
### Compare the scale correlations from scales and overlap
```{r}
scales$cor #the cor object of scales
overlap$cor #the cor object of overlap
lowerCor(scales$scores) #the observed correlations
```
### Examine the effect of drug on these scales
We will join the scales.scores output with the original data
(I am using cbind to get around a problem with `vJoin`)
```{r}
scales.drugs <- cbind(scales.scores,my.data[,13:14])
describe(scales.drugs)
cd <-cohen.d(scales.drugs ~ drug)
error.dots(cd)
```
Although the energy and tensions scales are independent, they are both affected by the drug manipulation.
We can use a technique known as `factor extension` to see how the drug effect fits into the two space of the items.
Factor extension was originally developed to handle the case of variables not measured in the original factor analysis.
The `fa.extend` function allows us to do this.
```{r fa.extend}
f2e <- fa.extend(my.data, 2, ov =1:12, ev =14)
f2e #show it
fa.plot(f2e, labels = colnames(my.data)) #plot it
diagram(f2e, ic=TRUE ) #another way to show it
```
# Scale validity: do they measure what they should
## a diversion reliability and validity
First a little function
```{r}
"bullseye" <- function(x,y,n) {
for(i in 1:n) {dia.ellipse(x,y,e.size=i)}}
"rel.val" <- function(n,sdx=.2,bars=TRUE,arrow.len=.05) {
if(n>20) {pc <- "."} else {pc <- 16}
plot(NA,xlim=c(0,10),ylim=c(0,10),axes=FALSE,xlab="",ylab="",main="Reliability and Validity as target shooting")
#Reliable and valid
x=3
y=2
bullseye(x,y,4)
x1 <- x + rnorm(n,0,sdx)
y1 <- y + rnorm(n,0,sdx)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Reliable and valid")
#unReliable and invalid
x=7
y=7
bullseye(x,y,4)
x1 <- x + rnorm(n,1,1)
y1 <- y + rnorm(n,1,1)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Unreliable and Invalid")
#reliable and invalid
x=7
y=2
bullseye(x,y,4)
x1 <- x + rnorm(n,1,sdx)
y1 <- y + rnorm(n,1,sdx)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars)error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Reliable and Invalid")
#unreliable, but valid
x=3
y=7
bullseye(x,y,4)
x1 <- x + rnorm(n,0,1)
y1 <- y + rnorm(n,0,1)
xm <- mean(x1)
ym <- mean(y1)
points(x1,y1,pch=pc)
points(xm,ym,pch=20,col="red")
if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="")
text(x,y-2,"Unreliable but Valid")
}
```
### now use thid function to show reliability and validity
```{r}
rel.val(6)
```
# Validity of Big 5 scores with 10 criteria
The `spi` data set has 135 personality items to measure the `big few` as well as a lower level set of 27 lower level factors. There are also 10 criteria.
We can score the Big 5 scales and then use these to predict criteria using linear regression.
## First score the scales
```{r}
keys <- spi.keys[1:5] #just use the big 5 keys
b5 <- scoreFast(keys,spi)
# use them to predict criteria
b5.crit <- vJoin(spi[1:10], b5)
colnames(b5.crit)
```
Now apply a linear regression, predicting the first 10 variables from the big 5 scores. Do not show the plot of the results.
```{r}
model <- lmCor(y=1:10,x=11:15, data=b5.crit, plot=FALSE)
summary(model)
```
Organize the results in a better way. Sort by R2
```{r}
ord <- order(model$R2)
model.s <- lmCor(y=ord, x=11:15,data=b5.crit,plot=FALSE)
summary(model.s)
```
### Cross valdate the predictions
Any data analytic technique will best fit the observed data. But what happens when we apply it to a new data set? This is the process of cross validation.
One typical technique is to split the sample into 2 parts and derive on one part and then apply the model to the second part.
The `crossValidation` function will help us do this.
First, we divide the sample into two random parts, derive the model on the first part, and then apply it to the second part.
```{r}
n.obs <- NROW(spi)
set.seed(42) #for reprodicibile results
ss <- sample(n.obs, n.obs/2) #sampling without replacement
derivation.model <- model <- lmCor(y=ord,x=11:15, data=b5.crit[ss,], plot=FALSE)
summary(derivation.model)
#now cross validate
cross.v <- crossValidation(derivation.model, data=b5.crit[-ss,])
summary(cross.v)
```
Those are pretty good. But what would happen if we had smaller samples for the derivation set?
### Try 100 samples of size 250
Widaman and Revelle (2023) have discussed this problem in a comparison with factor scores. Here we take their code and apply it just to the regression problem.
```{r}
selectB5 <- selectFromKeys(spi.keys[1:5])
b5 <- scoreItems(spi.keys[1:5], spi)
#Create samples of 250 and 1000
#find factor scores and unit weighted scores
#use these to find multiple R
#then, use the empirical factor weights from each sample to cross validate to the rest
big.list <- list()
N=250
N1=1000
set.seed(42) #this makes for reproducible results
n.iter<- 100
for (i in 1: n.iter) {
ss <- sample(NROW(spi),NROW(spi))
lastss <- length(ss)
f5.samp.250 <- fa(spi[ss[1:N],selectB5],5) # based upon the first 250
f5.samp.1000 <- fa(spi[ss[1:N1],selectB5],5) # based upon the first 1000
b5.scores <- scoreItems(spi.keys[1:5],spi[ss,])$scores
b5.samp.250 <- b5.scores[1:N,]
b5.samp.1000 <- b5.scores[1:N1, ]
#now find the validities of these 4 sets of 5 scales
df250 <- cbind(spi[ss[1:N],1:10], f5.samp.250$scores, b5.samp.250)
df1000 <- cbind(spi[ss[1:N1],1:10], f5.samp.1000$scores, b5.samp.1000)
model.f5.250 <- lmCor(1:10,11:15, data=df250, plot=FALSE)
model.b5.250 <- lmCor(1:10,16:20, data=df250, plot=FALSE)
model.f5.1000 <- lmCor(1:10,11:15, data=df1000, plot=FALSE)
model.b5.1000 <- lmCor(1:10,16:20, data=df1000, plot=FALSE)
derivation <- data.frame(model.f5.250$R,model.b5.250$R,model.f5.1000$R,model.b5.1000$R)
#now, cross validate
f5s.scores.250 <- factor.scores(spi[ss[(N+1):lastss], selectB5],f5.samp.250)$scores #this finds scores for all cases based upon first 250
f5s.scores.1000 <- factor.scores(spi[ss[(N1+1):lastss],selectB5],f5.samp.1000)$scores #this finds scores for all cases based upon first 1000
b5s.scores.250 <- b5.scores[(N+1):lastss,] #they are already randomized
b5s.scores.1000 <- b5.scores[(N1+1):lastss,]
dfs.250 <- cbind(spi[ss[(N+1):lastss],1:10],f5s.scores.250,b5s.scores.250)
dfs.1000 <- cbind(spi[ss[(N1+1):lastss],1:10], f5s.scores.1000,b5s.scores.1000)
#but we want to use the derivation model when we cross validate, otherwise it is not cross validation
#use the crossValidation function
model.f5.250c <- crossValidation(model.f5.250, data=dfs.250)
model.b5.250c <- crossValidation(model.b5.250, data=dfs.250)
model.f5.1000c <- crossValidation(model.f5.1000, data=dfs.1000)
model.b5.1000c <- crossValidation(model.b5.1000, data=dfs.1000)
#combine the results
validation <- data.frame(cvf.250=model.f5.250c$crossV$items,
cvu.250=model.b5.250c$crossV$items,
cvf.1000=model.f5.1000c$crossV$items,
cvu.1000=model.b5.1000c$crossV$items)
big.list[[i]] <- cbind(derivation,validation)
}
```
## organize these results
```{r}
#We have done the derivation and replications for 100 samples.
#now, take the 100 replications and summarize them
sum.df <- matrix(unlist(big.list),byrow=TRUE,ncol=80)
sum.df <- as.data.frame(sum.df)
dv <- cs(f250,u250,f1000,u1000,cf250,cu250,cf1000,cu1000)
dv8 <- rep(dv,each=10)
colnames(sum.df) <- paste0(rep(colnames(spi)[1:10],8),dv8)
dd <- describe(sum.df)
ord <- order(dd$mean[71:80],decreasing=FALSE)
#now, repeat this redordering for the 8 cases
#ord is now the order of the best to worst unit weighted cross validated values
#error.bars(sum.df[ord])
# error.bars(sum.df[ord+10],add=TRUE)
#organise for a table
sort.lab <- colnames(spi)[ord]
new.ord <- rep(ord,8) + rep(c(0,10,20,30,40,50, 60,70),each=10)
new.dd <- dd[new.ord,]
sort.lab20 <- c(sort.lab,sort.lab)
wide.dd <- data.frame(sort.lab20,new.dd[c(1:10,41:50),3:4], new.dd[c(11:20,51:60),3:4], new.dd[c(21:30,61:70),3:4],new.dd[c(31:40,71:80),3:4])
wide.dd <- data.frame(sort.lab20,new.dd[c(1:10,41:50),3:4], new.dd[c(11:20,51:60),3:4], new.dd[c(21:30,61:70),3:4],new.dd[c(31:40,71:80),3:4])
rn <- paste0(rep(cs(Derivation,Validation),each=10),colnames(spi)[ord])
rownames(wide.dd)<- rn
colnames(wide.dd )<- c("Variable","Mean F 250","SD F 250","Mean U 250", "SD U 250","Mean F 1000","SD F 1000",
"Mean U 1000", "SD U 1000")
#first draw the derivation R for the four conditions, then draw the cross validated R (separate graphs)
error.bars(sum.df[ord+30],add=FALSE,lty="solid",type="l",labels=sort.lab,col=rep("black",10 ),main="Multiple R for 10 criteria on derivation sample", xlab="", ylab="Cross validated R", ylim=c(0.05,.45) , sd=FALSE) #unit weighted 1000
#factor scores 1000
error.bars(sum.df[ord+20],add=TRUE,lty="dashed",type="l",labels=sort.lab,col=rep("blue",10 ), ylim=c(0.05,.45), sd= FALSE)
#factror scores 250
error.bars(sum.df[ord],add=TRUE,lty="dotted",type="l",labels=sort.lab,col=rep("green",10 ),sd= FALSE)
#unit weighted 250
error.bars(sum.df[ord+10],add=TRUE,lty="dashed",type="l",col=rep("red",10), sd= FALSE)
#cross validafed R
error.bars(sum.df[ord+70],add=FALSE,lty="solid",type="l",labels=sort.lab,col=rep("black",10 ),main="Cross validated Multiple R for 10 criteria", xlab="", ylab="Cross validated R", ylim=c(0.0,.45) , sd=FALSE) #unit weighted 1000
#factor scores 1000
error.bars(sum.df[ord+60],add=TRUE,lty="dashed",type="l",labels=sort.lab,col=rep("blue",10 ), ylim=c(0.00,.45), sd= FALSE)
#factor scores 250
error.bars(sum.df[ord+40],add=TRUE,lty="dotted",type="l",labels=sort.lab,col=rep("green",10 ),sd= FALSE)
#unit weighted 250
error.bars(sum.df[ord+50],add=TRUE,lty="dashed",type="l",col=rep("red",10), sd= FALSE)
legend("topleft",c( "Unit weighted 1000", "Unit weighed 250", "Factor scores 1000", "Factor scores 250"),
lty=cs(solid,dashed,dashed,dotted), col=cs(black,red,blue,green))
```