---
title: "305.week6 Working with real data"
author: "William Revelle"
date: '2022-05-02'
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=100)
```


# First some prelimaries

The 'psych' and 'psychTools' packages have been updated to versions 2.2.5.  You need to install them and then restart RStudio.

After doing so, 

```{r}
library(psych)
library(psychTools)
sessionInfo()  #to make sure you have the most recent release
```
Alice Eagly and I have been working on a manuscript examining the effect of data aggregation on the validity of scales.  The paper (https://personality-project.org/revelle/publications/eagly.revelle.21.pdf )
Understanding the Magnitude of Psychological Differences
Between Women and Men Requires Seeing the Forest and the Trees (in press)) uses data from an article by Ursala Athenstaedt (2003).  Here we show the steps necessary to convert a raw data set into the final, publication ready graphics that we have in our paper.

In the process of final proof reading, we discovered some errors in the way we had processed the data.  This is a nice example of the to document your scripts and to double check (and then check again) your analyses.

## Getting the original data

The data are in an SPSS sav file that I have put into the course files.  We need to get the name of that file, and then read it.

```{r}
fn <- "https://personality-project.org/courses/350/3studien.sav"
ursala <- read.file(fn)
dim(ursala)   #how many subjects (rows) and variables (columns)
colnames(ursala) 
describe(ursala)  #always describe the data!
```
# Cleaning up the data

We see several things from the description of the file.  Although the items are supposed to range from 1 to 7, many of them include 9, one even includes 43.  The items are in columns 3 - 76, with some scale information after that.

We will 'scrub' the data to convert all missing values to NA and then describe the data again. We do this by specifying that the maximum value we want is 7. 

We want to select 'complete.cases' (no missing values).
We do this by finding which subjects are complete, and then making up the new data set with just complete cases.


```{r}
cleaned <- scrub(ursala,where= 3:107, max=7 )
complete <-complete.cases(cleaned)
cleaned <- cleaned[complete,]  #just those with complete cases
dim(cleaned)
table(cleaned[,2]) #how many cases?
```

We seem to have few subjects than Athenstaedt reported (310 females, 266 males).  Perhaps we should clean somewhat fewer variables?

(After several attempts, the right value seems to be columns 1-92).

```{r}
cleaned <- scrub(ursala,where= 3:92, max=7 )
complete <-complete.cases(cleaned)
cleaned <- cleaned[complete,]  #just those with complete cases
dim(cleaned)
table(cleaned[,2]) #how many cases?
ursala.clean <- cleaned
```

## Making a dictionary 

Rather than refer to the data as merely items V1 .. V76, it would be nice to know the items.  We create a csv file in our friendly text editor, and the read this in as a dictionary.

```{r}
fn  <- "https://personality-project.org/courses/350/Athenstaedt.dictionary.csv"
ursala.dictionary <- read.file(fn) #automatically reads csv

rownames(ursala.dictionary) <- ursala.dictionary[,1]
headTail(ursala.dictionary, top=8,bottom=8)

```

Although Athenstaedt gave 76 items, her subsequent analysis just used 53 of these.  We select those items and then do some basic analysis.

One of those items "put on make up" shows a much bigger effect size than any of the others, and we drop it to make the data more reasonable.

```{r}
select <- c("V1", "V2", "V10", "V12", "V15", "V17", "V18", "V19", "V20", 
"V24", "V25", "V27", "V37", "V50", "V53", "V58", "V60", "V62", 
"V66", "V67", "V68")

 sel <- which(colnames(ursala.clean) %in% select)

ursala.small <- ursala.clean[-sel]

cd.ursala <- cohen.d(ursala.small[2:55],group= "GESCHL",dictionary=ursala.dictionary)
cd.ursala
#note that we just want column 2
cd.ursala1 <- cohen.d(ursala.small[2:54],group= "GESCHL",dictionary=ursala.dictionary[,2, drop=FALSE], sort="descending")
cd.ursala1
error.dots(cd.ursala1,head=26,tail=26,main="Athenstaedt's Behavioral Data - Effect sizes and standard errors")
abline(v=0)

```

# Identify two dimensions of the scales


Do a factor analysis, show the items, then choose those items that have their highest loadings on each factor.

We need to hand fix these because one item is mis-classified. 
```{r}
f2 <- fa(ursala.small[3:54],2)   #this drops the make up item

 fa.lookup(f2, dictionary=ursala.dictionary)
 summary(f2)

 ursala.keys <- factor2cluster(f2)
 
 
 ursala.keys <- cbind(ursala.keys,ursala.keys[,1] - ursala.keys[,2])  #add a total
 ursala.keys["V21",] <- c(1,0,1)  # a hard patch
 urs.keys <- keys2list(ursala.keys)
 

 names(urs.keys)[3] <- "F+M behavior"
 
  urs.keys  # show them 
 urs.scales <- scoreItems(urs.keys,ursala.small)
 #show the basic statistics
 urs.scales
 expand.ursala <- data.frame(ursala.small[2:54],urs.scales$scores)
```


# What are the Cohen.d values for our scales?

Use the 'cohen.d function for just those scale values. 
```{r}
cd.scales <- cohen.d(expand.ursala[c(1,54,55,56)],"GESCHL")
cd.scales
```


# Some basic psychometrics

We want to find the reliabilities of our new scales.  We will call the 'reliability' function which in turn calls multiple reliability functions (omega, splitHalf).  Note that the 'alpha' values match what we found when we used 'scoreItems'.


```{r}
rel <- reliability(urs.keys,expand.ursala)


ursala.average.cd <- cd.validity(cd.ursala,urs.keys)

ursala.validity <- cor(expand.ursala[c(1,54:56)])[1,2:4]
 ursala.item.validity <-  item.validity(expand.ursala,expand.ursala[,"GESCHL"],urs.keys)
 
 ursala.validity <- cor(expand.ursala[c(1,54:56)])[1,2:4]
 
 
  rel
 ursala.average.cd  #in d units
 ursala.item.validity # in r units
```

### Predicted validity

Eagly and Revelle (2022) suggest that the predicted validity of a scale is the average item validity/sqrt(average within scale correlation)

The 'predicted.validity' function does this.

```{r}
predicted <- predicted.validity(expand.ursala,expand.ursala[,"GESCHL"],urs.keys)
predicted   #asymptotic values are the most we can expect 
predicted$predicted  # compare these with what we actually observed
ursala.validity  #the correlations with the criterion
```

## Organize a data frame of results

This is the table to put into the manuscript. It is interesting to compare the average absolute effect size (.67) with what we get for the pooled values (2.89). 
```{r}
table.df <- data.frame(k=rel$result.df[,11],
      rel$result.df[,c(1:3,9)], item.val=ursala.item.validity,   scale=ursala.validity,
      predicted = predicted$predicted, average.cd = ursala.average.cd, cd=cd.scales$cohen.d[,"effect"] )
  round(table.df,2)

mean(abs(cd.ursala1$cohen.d[,"effect"]))  #how big is the average absolute effect size?

```

## Now do the final graphics

```{r}
cd.items <- cohen.d(expand.ursala[-56] , group = "GESCHL",dictionary=ursala.dictionary[,2,drop=FALSE])

 error.dots(cd.items,head=30,tail=30, main="Athenstaedt’s Behavioral Data \n Items and Scales",
    select=2:53)
 error.dots(cd.items,head=30,tail=30, main="Athenstaedt’s Behavioral Data \n Items and Scales",
    select=c(1,54), color='red', fg="red", add=TRUE)
abline(v=0)
```

As well as the scatter histogram

```{r}
scatterHist(MR1~  MR2 + GESCHL, data=expand.ursala[1:55],cex.point=.4,smooth=FALSE,
       xlab="Masculine Scale",ylab="Feminine Scale",
                      correl=FALSE,d.arrow=TRUE,col=c("red","blue"), lwd=4,
                      cex.main=1.5,main="Scatter Plot and Density",cex.axis=2)
```