--- title: "305.week6 Working with real data" author: "William Revelle" date: '2022-05-02' output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=100) ``` # First some prelimaries The 'psych' and 'psychTools' packages have been updated to versions 2.2.5. You need to install them and then restart RStudio. After doing so, ```{r} library(psych) library(psychTools) sessionInfo() #to make sure you have the most recent release ``` Alice Eagly and I have been working on a manuscript examining the effect of data aggregation on the validity of scales. The paper (https://personality-project.org/revelle/publications/eagly.revelle.21.pdf ) Understanding the Magnitude of Psychological Differences Between Women and Men Requires Seeing the Forest and the Trees (in press)) uses data from an article by Ursala Athenstaedt (2003). Here we show the steps necessary to convert a raw data set into the final, publication ready graphics that we have in our paper. In the process of final proof reading, we discovered some errors in the way we had processed the data. This is a nice example of the to document your scripts and to double check (and then check again) your analyses. ## Getting the original data The data are in an SPSS sav file that I have put into the course files. We need to get the name of that file, and then read it. ```{r} fn <- "https://personality-project.org/courses/350/3studien.sav" ursala <- read.file(fn) dim(ursala) #how many subjects (rows) and variables (columns) colnames(ursala) describe(ursala) #always describe the data! ``` # Cleaning up the data We see several things from the description of the file. Although the items are supposed to range from 1 to 7, many of them include 9, one even includes 43. The items are in columns 3 - 76, with some scale information after that. We will 'scrub' the data to convert all missing values to NA and then describe the data again. We do this by specifying that the maximum value we want is 7. We want to select 'complete.cases' (no missing values). We do this by finding which subjects are complete, and then making up the new data set with just complete cases. ```{r} cleaned <- scrub(ursala,where= 3:107, max=7 ) complete <-complete.cases(cleaned) cleaned <- cleaned[complete,] #just those with complete cases dim(cleaned) table(cleaned[,2]) #how many cases? ``` We seem to have few subjects than Athenstaedt reported (310 females, 266 males). Perhaps we should clean somewhat fewer variables? (After several attempts, the right value seems to be columns 1-92). ```{r} cleaned <- scrub(ursala,where= 3:92, max=7 ) complete <-complete.cases(cleaned) cleaned <- cleaned[complete,] #just those with complete cases dim(cleaned) table(cleaned[,2]) #how many cases? ursala.clean <- cleaned ``` ## Making a dictionary Rather than refer to the data as merely items V1 .. V76, it would be nice to know the items. We create a csv file in our friendly text editor, and the read this in as a dictionary. ```{r} fn <- "https://personality-project.org/courses/350/Athenstaedt.dictionary.csv" ursala.dictionary <- read.file(fn) #automatically reads csv rownames(ursala.dictionary) <- ursala.dictionary[,1] headTail(ursala.dictionary, top=8,bottom=8) ``` Although Athenstaedt gave 76 items, her subsequent analysis just used 53 of these. We select those items and then do some basic analysis. One of those items "put on make up" shows a much bigger effect size than any of the others, and we drop it to make the data more reasonable. ```{r} select <- c("V1", "V2", "V10", "V12", "V15", "V17", "V18", "V19", "V20", "V24", "V25", "V27", "V37", "V50", "V53", "V58", "V60", "V62", "V66", "V67", "V68") sel <- which(colnames(ursala.clean) %in% select) ursala.small <- ursala.clean[-sel] cd.ursala <- cohen.d(ursala.small[2:55],group= "GESCHL",dictionary=ursala.dictionary) cd.ursala #note that we just want column 2 cd.ursala1 <- cohen.d(ursala.small[2:54],group= "GESCHL",dictionary=ursala.dictionary[,2, drop=FALSE], sort="descending") cd.ursala1 error.dots(cd.ursala1,head=26,tail=26,main="Athenstaedt's Behavioral Data - Effect sizes and standard errors") abline(v=0) ``` # Identify two dimensions of the scales Do a factor analysis, show the items, then choose those items that have their highest loadings on each factor. We need to hand fix these because one item is mis-classified. ```{r} f2 <- fa(ursala.small[3:54],2) #this drops the make up item fa.lookup(f2, dictionary=ursala.dictionary) summary(f2) ursala.keys <- factor2cluster(f2) ursala.keys <- cbind(ursala.keys,ursala.keys[,1] - ursala.keys[,2]) #add a total ursala.keys["V21",] <- c(1,0,1) # a hard patch urs.keys <- keys2list(ursala.keys) names(urs.keys)[3] <- "F+M behavior" urs.keys # show them urs.scales <- scoreItems(urs.keys,ursala.small) #show the basic statistics urs.scales expand.ursala <- data.frame(ursala.small[2:54],urs.scales$scores) ``` # What are the Cohen.d values for our scales? Use the 'cohen.d function for just those scale values. ```{r} cd.scales <- cohen.d(expand.ursala[c(1,54,55,56)],"GESCHL") cd.scales ``` # Some basic psychometrics We want to find the reliabilities of our new scales. We will call the 'reliability' function which in turn calls multiple reliability functions (omega, splitHalf). Note that the 'alpha' values match what we found when we used 'scoreItems'. ```{r} rel <- reliability(urs.keys,expand.ursala) ursala.average.cd <- cd.validity(cd.ursala,urs.keys) ursala.validity <- cor(expand.ursala[c(1,54:56)])[1,2:4] ursala.item.validity <- item.validity(expand.ursala,expand.ursala[,"GESCHL"],urs.keys) ursala.validity <- cor(expand.ursala[c(1,54:56)])[1,2:4] rel ursala.average.cd #in d units ursala.item.validity # in r units ``` ### Predicted validity Eagly and Revelle (2022) suggest that the predicted validity of a scale is the average item validity/sqrt(average within scale correlation) The 'predicted.validity' function does this. ```{r} predicted <- predicted.validity(expand.ursala,expand.ursala[,"GESCHL"],urs.keys) predicted #asymptotic values are the most we can expect predicted$predicted # compare these with what we actually observed ursala.validity #the correlations with the criterion ``` ## Organize a data frame of results This is the table to put into the manuscript. It is interesting to compare the average absolute effect size (.67) with what we get for the pooled values (2.89). ```{r} table.df <- data.frame(k=rel$result.df[,11], rel$result.df[,c(1:3,9)], item.val=ursala.item.validity, scale=ursala.validity, predicted = predicted$predicted, average.cd = ursala.average.cd, cd=cd.scales$cohen.d[,"effect"] ) round(table.df,2) mean(abs(cd.ursala1$cohen.d[,"effect"])) #how big is the average absolute effect size? ``` ## Now do the final graphics ```{r} cd.items <- cohen.d(expand.ursala[-56] , group = "GESCHL",dictionary=ursala.dictionary[,2,drop=FALSE]) error.dots(cd.items,head=30,tail=30, main="Athenstaedt’s Behavioral Data \n Items and Scales", select=2:53) error.dots(cd.items,head=30,tail=30, main="Athenstaedt’s Behavioral Data \n Items and Scales", select=c(1,54), color='red', fg="red", add=TRUE) abline(v=0) ``` As well as the scatter histogram ```{r} scatterHist(MR1~ MR2 + GESCHL, data=expand.ursala[1:55],cex.point=.4,smooth=FALSE, xlab="Masculine Scale",ylab="Feminine Scale", correl=FALSE,d.arrow=TRUE,col=c("red","blue"), lwd=4, cex.main=1.5,main="Scatter Plot and Density",cex.axis=2) ```