--- title: "350.scale construction" author: "William Revelle" date: "`r Sys.Date()`" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=110) ``` `psych` and `psychTools` have a few subtle modifications since Monday. The version numbers have not changed, but the dates have. ```{r} #install.packages("psych",repos="http://personality-project.org/r", type="source") #install.packages("psychTools",repos="http://personality-project.org/r", type="source") library(psych) #make it active library(psychTools) sessionInfo() #psych should be 2.3.5 and psychTools should be 2.3.5 packageDate("psych"); packageDate("psychTools") #should be >= "2023-05-09" ``` # Constructing scales (Preface) The typical measure in psychology, whether measuring attitudes, temperament, or ability is composed of multiple items. This is partly because our constructs are broad and we want to capture the various nuances of our constructs. It is also because the reliability of a sum ( or average) of a set of items is much higher than any single item `Revelle and Condon, 2019` Reliability: from alpha to omega. See class notes on Classical test theory. Thus, we form item composites known as scales. The various steps to do this are discussed in the `How to` scoring scales linked here and in the syllabus. That `How to` links a number of other pages that are meant to facilitate your work. This Rmd and HTML files allow you to do each step yourself. # Get the data and do basic data cleaning. ## Step 1: Collecting the data For the first law of data analysis is you must have some data. Choose an appropriate subject population, write items that you (and others) think measure the constructs, and then give these items to your participants. Make sure that they are engaged in doing the task. ## Step 2: Getting the data into `R` Make `psych` active. ```{r} library(psych) library(psychTools) ``` Core R functions: `file.name <- file.choose() my.data <- read.table(file.name) ` `psych` functions. `my.data <- read.file()' #locate the file to read using your normal system. or read from a remote file: e.g., ```{r} file.name <- "https://personality-project.org/r/psych/HowTo/scoring.tutorial/small.msq.txt" my.data <- read.file(file.name) ``` ### checking basic qualities of the data ```{r} dim(my.data) #how many rows (subjects) and columns (variables) headTail(my.data) #show the first and last few rows describe(my.data) #basic descriptive statistics ``` The first 12 variables are clearly affective/mood variables. Should we form one, or two or many scales from these data? On solution is to examine the data in terms of their dimensionality. `fa` will do a factor analysis. We can compare a 1 factor solution to a 2 factor solution. ```{r one factor} f1 <- fa(my.data[,1:12]) #just look at the mood variables f1 ``` Compare this to a two factor solution ```{r two factor} f2 <- fa(my.data[,1:12],2) #specify that we want two factors f2 fa.plot(f2, labels = colnames(my.data)[1:12]) ``` ## Create scoring keys for a 1 and 2 factor solution In order to score the data, we need to specify which items we want to use and in which direction we want to score them. We can do this by hand, or use a simple function. Lets do it by hand. We will call the two factor solution energy and tension. ```{r} keys <- list(onefactor = cs(active,alert, aroused, -sleepy,-tired, -drowsy,anxious, jittery,nervous, -calm,-relaxed, -at.ease), energy=cs(active,alert, aroused, -sleepy,-tired, -drowsy), tension =cs(anxious, jittery,nervous, -calm,-relaxed, -at.ease)) keys ``` We can use these keys to score the items. There are several functions that will do this. `scoreFast` just finds the mean for each set of items. `scoreItems` does this, but also report various statistics. `scoreOverlap` will not find scores, but will find the correlations of the scales correcting for item overlap. For each of these functions, we specify the scoring `keys` information. ```{r} simple.scores <- scoreFast(keys,my.data) dim(simple.scores) scales <- scoreItems(keys,my.data) #more complicated output scales #does not correct for overlapping items names(scales) #many objects are returned scales.scores <- scales$scores # get the scores cor2(simple.scores,scales.scores) overlap <- scoreOverlap(keys, my.data) #corrects for item overlap overlap names(overlap) ``` ### Compare the scale correlations from scales and overlap ```{r} scales$cor #the cor object of scales overlap$cor #the cor object of overlap lowerCor(scales$scores) #the observed correlations ``` ### Examine the effect of drug on these scales We will join the scales.scores output with the original data (I am using cbind to get around a problem with `vJoin`) ```{r} scales.drugs <- cbind(scales.scores,my.data[,13:14]) describe(scales.drugs) cd <-cohen.d(scales.drugs ~ drug) error.dots(cd) ``` Although the energy and tensions scales are independent, they are both affected by the drug manipulation. We can use a technique known as `factor extension` to see how the drug effect fits into the two space of the items. Factor extension was originally developed to handle the case of variables not measured in the original factor analysis. The `fa.extend` function allows us to do this. ```{r fa.extend} f2e <- fa.extend(my.data, 2, ov =1:12, ev =14) f2e #show it fa.plot(f2e, labels = colnames(my.data)) #plot it diagram(f2e, ic=TRUE ) #another way to show it ``` # Scale validity: do they measure what they should ## a diversion reliability and validity First a little function ```{r} "bullseye" <- function(x,y,n) { for(i in 1:n) {dia.ellipse(x,y,e.size=i)}} "rel.val" <- function(n,sdx=.2,bars=TRUE,arrow.len=.05) { if(n>20) {pc <- "."} else {pc <- 16} plot(NA,xlim=c(0,10),ylim=c(0,10),axes=FALSE,xlab="",ylab="",main="Reliability and Validity as target shooting") #Reliable and valid x=3 y=2 bullseye(x,y,4) x1 <- x + rnorm(n,0,sdx) y1 <- y + rnorm(n,0,sdx) xm <- mean(x1) ym <- mean(y1) points(x1,y1,pch=pc) points(xm,ym,pch=20,col="red") if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="") text(x,y-2,"Reliable and valid") #unReliable and invalid x=7 y=7 bullseye(x,y,4) x1 <- x + rnorm(n,1,1) y1 <- y + rnorm(n,1,1) xm <- mean(x1) ym <- mean(y1) points(x1,y1,pch=pc) points(xm,ym,pch=20,col="red") if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="") text(x,y-2,"Unreliable and Invalid") #reliable and invalid x=7 y=2 bullseye(x,y,4) x1 <- x + rnorm(n,1,sdx) y1 <- y + rnorm(n,1,sdx) xm <- mean(x1) ym <- mean(y1) points(x1,y1,pch=pc) points(xm,ym,pch=20,col="red") if(bars)error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="") text(x,y-2,"Reliable and Invalid") #unreliable, but valid x=3 y=7 bullseye(x,y,4) x1 <- x + rnorm(n,0,1) y1 <- y + rnorm(n,0,1) xm <- mean(x1) ym <- mean(y1) points(x1,y1,pch=pc) points(xm,ym,pch=20,col="red") if(bars) error.crosses(x1,y1,add=TRUE,arrow.len=arrow.len,labels="") text(x,y-2,"Unreliable but Valid") } ``` ### now use thid function to show reliability and validity ```{r} rel.val(6) ``` # Validity of Big 5 scores with 10 criteria The `spi` data set has 135 personality items to measure the `big few` as well as a lower level set of 27 lower level factors. There are also 10 criteria. We can score the Big 5 scales and then use these to predict criteria using linear regression. ## First score the scales ```{r} keys <- spi.keys[1:5] #just use the big 5 keys b5 <- scoreFast(keys,spi) # use them to predict criteria b5.crit <- vJoin(spi[1:10], b5) colnames(b5.crit) ``` Now apply a linear regression, predicting the first 10 variables from the big 5 scores. Do not show the plot of the results. ```{r} model <- lmCor(y=1:10,x=11:15, data=b5.crit, plot=FALSE) summary(model) ``` Organize the results in a better way. Sort by R2 ```{r} ord <- order(model$R2) model.s <- lmCor(y=ord, x=11:15,data=b5.crit,plot=FALSE) summary(model.s) ``` ### Cross valdate the predictions Any data analytic technique will best fit the observed data. But what happens when we apply it to a new data set? This is the process of cross validation. One typical technique is to split the sample into 2 parts and derive on one part and then apply the model to the second part. The `crossValidation` function will help us do this. First, we divide the sample into two random parts, derive the model on the first part, and then apply it to the second part. ```{r} n.obs <- NROW(spi) set.seed(42) #for reprodicibile results ss <- sample(n.obs, n.obs/2) #sampling without replacement derivation.model <- model <- lmCor(y=ord,x=11:15, data=b5.crit[ss,], plot=FALSE) summary(derivation.model) #now cross validate cross.v <- crossValidation(derivation.model, data=b5.crit[-ss,]) summary(cross.v) ``` Those are pretty good. But what would happen if we had smaller samples for the derivation set? ### Try 100 samples of size 250 Widaman and Revelle (2023) have discussed this problem in a comparison with factor scores. Here we take their code and apply it just to the regression problem. ```{r} selectB5 <- selectFromKeys(spi.keys[1:5]) b5 <- scoreItems(spi.keys[1:5], spi) #Create samples of 250 and 1000 #find factor scores and unit weighted scores #use these to find multiple R #then, use the empirical factor weights from each sample to cross validate to the rest big.list <- list() N=250 N1=1000 set.seed(42) #this makes for reproducible results n.iter<- 100 for (i in 1: n.iter) { ss <- sample(NROW(spi),NROW(spi)) lastss <- length(ss) f5.samp.250 <- fa(spi[ss[1:N],selectB5],5) # based upon the first 250 f5.samp.1000 <- fa(spi[ss[1:N1],selectB5],5) # based upon the first 1000 b5.scores <- scoreItems(spi.keys[1:5],spi[ss,])$scores b5.samp.250 <- b5.scores[1:N,] b5.samp.1000 <- b5.scores[1:N1, ] #now find the validities of these 4 sets of 5 scales df250 <- cbind(spi[ss[1:N],1:10], f5.samp.250$scores, b5.samp.250) df1000 <- cbind(spi[ss[1:N1],1:10], f5.samp.1000$scores, b5.samp.1000) model.f5.250 <- lmCor(1:10,11:15, data=df250, plot=FALSE) model.b5.250 <- lmCor(1:10,16:20, data=df250, plot=FALSE) model.f5.1000 <- lmCor(1:10,11:15, data=df1000, plot=FALSE) model.b5.1000 <- lmCor(1:10,16:20, data=df1000, plot=FALSE) derivation <- data.frame(model.f5.250$R,model.b5.250$R,model.f5.1000$R,model.b5.1000$R) #now, cross validate f5s.scores.250 <- factor.scores(spi[ss[(N+1):lastss], selectB5],f5.samp.250)$scores #this finds scores for all cases based upon first 250 f5s.scores.1000 <- factor.scores(spi[ss[(N1+1):lastss],selectB5],f5.samp.1000)$scores #this finds scores for all cases based upon first 1000 b5s.scores.250 <- b5.scores[(N+1):lastss,] #they are already randomized b5s.scores.1000 <- b5.scores[(N1+1):lastss,] dfs.250 <- cbind(spi[ss[(N+1):lastss],1:10],f5s.scores.250,b5s.scores.250) dfs.1000 <- cbind(spi[ss[(N1+1):lastss],1:10], f5s.scores.1000,b5s.scores.1000) #but we want to use the derivation model when we cross validate, otherwise it is not cross validation #use the crossValidation function model.f5.250c <- crossValidation(model.f5.250, data=dfs.250) model.b5.250c <- crossValidation(model.b5.250, data=dfs.250) model.f5.1000c <- crossValidation(model.f5.1000, data=dfs.1000) model.b5.1000c <- crossValidation(model.b5.1000, data=dfs.1000) #combine the results validation <- data.frame(cvf.250=model.f5.250c$crossV$items, cvu.250=model.b5.250c$crossV$items, cvf.1000=model.f5.1000c$crossV$items, cvu.1000=model.b5.1000c$crossV$items) big.list[[i]] <- cbind(derivation,validation) } ``` ## organize these results ```{r} #We have done the derivation and replications for 100 samples. #now, take the 100 replications and summarize them sum.df <- matrix(unlist(big.list),byrow=TRUE,ncol=80) sum.df <- as.data.frame(sum.df) dv <- cs(f250,u250,f1000,u1000,cf250,cu250,cf1000,cu1000) dv8 <- rep(dv,each=10) colnames(sum.df) <- paste0(rep(colnames(spi)[1:10],8),dv8) dd <- describe(sum.df) ord <- order(dd$mean[71:80],decreasing=FALSE) #now, repeat this redordering for the 8 cases #ord is now the order of the best to worst unit weighted cross validated values #error.bars(sum.df[ord]) # error.bars(sum.df[ord+10],add=TRUE) #organise for a table sort.lab <- colnames(spi)[ord] new.ord <- rep(ord,8) + rep(c(0,10,20,30,40,50, 60,70),each=10) new.dd <- dd[new.ord,] sort.lab20 <- c(sort.lab,sort.lab) wide.dd <- data.frame(sort.lab20,new.dd[c(1:10,41:50),3:4], new.dd[c(11:20,51:60),3:4], new.dd[c(21:30,61:70),3:4],new.dd[c(31:40,71:80),3:4]) wide.dd <- data.frame(sort.lab20,new.dd[c(1:10,41:50),3:4], new.dd[c(11:20,51:60),3:4], new.dd[c(21:30,61:70),3:4],new.dd[c(31:40,71:80),3:4]) rn <- paste0(rep(cs(Derivation,Validation),each=10),colnames(spi)[ord]) rownames(wide.dd)<- rn colnames(wide.dd )<- c("Variable","Mean F 250","SD F 250","Mean U 250", "SD U 250","Mean F 1000","SD F 1000", "Mean U 1000", "SD U 1000") #first draw the derivation R for the four conditions, then draw the cross validated R (separate graphs) error.bars(sum.df[ord+30],add=FALSE,lty="solid",type="l",labels=sort.lab,col=rep("black",10 ),main="Multiple R for 10 criteria on derivation sample", xlab="", ylab="Cross validated R", ylim=c(0.05,.45) , sd=FALSE) #unit weighted 1000 #factor scores 1000 error.bars(sum.df[ord+20],add=TRUE,lty="dashed",type="l",labels=sort.lab,col=rep("blue",10 ), ylim=c(0.05,.45), sd= FALSE) #factror scores 250 error.bars(sum.df[ord],add=TRUE,lty="dotted",type="l",labels=sort.lab,col=rep("green",10 ),sd= FALSE) #unit weighted 250 error.bars(sum.df[ord+10],add=TRUE,lty="dashed",type="l",col=rep("red",10), sd= FALSE) #cross validafed R error.bars(sum.df[ord+70],add=FALSE,lty="solid",type="l",labels=sort.lab,col=rep("black",10 ),main="Cross validated Multiple R for 10 criteria", xlab="", ylab="Cross validated R", ylim=c(0.0,.45) , sd=FALSE) #unit weighted 1000 #factor scores 1000 error.bars(sum.df[ord+60],add=TRUE,lty="dashed",type="l",labels=sort.lab,col=rep("blue",10 ), ylim=c(0.00,.45), sd= FALSE) #factor scores 250 error.bars(sum.df[ord+40],add=TRUE,lty="dotted",type="l",labels=sort.lab,col=rep("green",10 ),sd= FALSE) #unit weighted 250 error.bars(sum.df[ord+50],add=TRUE,lty="dashed",type="l",col=rep("red",10), sd= FALSE) legend("topleft",c( "Unit weighted 1000", "Unit weighed 250", "Factor scores 1000", "Factor scores 250"), lty=cs(solid,dashed,dashed,dotted), col=cs(black,red,blue,green)) ```