--- title: "350 week 8 other topics" author: "William Revelle" date: "`r Sys.Date()`" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=100) ``` ## Some miscellaneous topics #### Getting the attibute information from an SPSS.sav file. #### A function to do Hare voting #### Bootstrapping a regression to find confidence intervals ## SPSS .sav files, the attributes feature Although it is easy to read an SPSS.sav file to get the basic data, this does not recover the complex coding system embedded in SPSS. Consider the `glbwarm.sav' data file in the class folder. We will read the data in two different ways, one just getting the values, the other getting the various encodings saved by SPSS. It turns out the the normal (in psych) read.file function does actually get the attribute information after all. (Who knew?) ```{r} library(psych) library(psychTools) fn <- 'https://personality-project.org/courses/350/datasets/glbwarm.sav' glb <- read.file(fn) #this gets the data and puts into a data.frame dim(glb) names(glb) colnames(glb) attributes(glb) library(foreign) #allows for more complicated reading of files glb.sav <- read.spss(file=fn) #this uses the default values and saves everything as a list names(glb.sav) dim(glb.sav) #it is a list and has a length but not a dim attributes(glb.sav) ``` ### Why is this relevant? One frequently wants to report the item names that go with a data set. SPSS allows us to store the information with the data. Normally, we include this information is a separate file (e.g. a dictionary associated with a data set). e.g. 'fa.lookup' needs a dictionary. # Hare scaling of voting behavior Some voting systems and some organizations using a preference based voting system to choose officers. his is sometimes known as 'Hare' voting. I was asked to take a set of rankings and find convert them to a choice model. ```{r} #consider the following ranks data <- matrix(NA,20,6) set.seed(42) for (i in 1:20) { data[i,]<- sample(6,6) } colnames(data) <- letters[1:NCOL(data)] rownames(data) <- paste0("S",1:NROW(data)) dfOrder(data) #show it describe(data) apply(data,2,table) colMeans(data) #what is the mean ranking ``` # Thurstonian scaling of these data ```{r} thurstone(data,ranks=TRUE) ``` ## Lets write a function to do Hare voting ```{r} #do hare ranking hare <- function(x, n=1) { #loop over candidates #find the person(s) with the fewest 1 place votes while(NCOL(x) > n) { #for(iter in (1:(NCOL(x)-n))) { #continue until n +1 are left count.first <- colSums(apply(x,2,function(x) x==1), na.rm=TRUE) min.count <- min(count.first) which.min <- which(count.first == min.count) which.min <- which.min[1] #just do one candidate at a time #reallocate their votes who.min <- which(x[,which.min]==1, arr.ind=TRUE) #the row is the person we want if(NROW(who.min )> 0) { if(!is.null(dim(who.min))) who.min <- who.min[,1] for(i in (1:length(who.min))) { x[who.min[i],] <- x[who.min[i],] -1} } #reallocate their votes x <- x[,-which.min,drop=FALSE] #drop the lowest vote getter #do it again until the NCOL = the number we want } if(NCOL(x) > 1){ count.first <- colSums(apply(x,2,function(x) x==1), na.rm=TRUE) result <- sort(count.first,decreasing=TRUE)[1:n]} else {result <- colSums(x==1,na.rm=TRUE) } return(result)} #show the winners ``` # Try the function with our data ```{r} hare(data,6) hare(data,5) hare(data,4) hare(data,3) #the top 3 hare(data,2) #the top 2 hare(data) #just one winner ``` # Another problem Trying to show confidence intervals of regressions Estimating confidence intervals from regression 1. This example is in response to a question from Mike Bailey 2. Given two predictors, what is their shared contribution 3. See the Venn diagram from Mike 4. We do this with a sample data set attitude 5. And then do some analyses 6. The case we use is actually an example of a suppressor, so these are expressed in standardized regression coefficients. 7. We then bootstrap them to show their confidence intervals ## Use the attitude data set from core R ```{r} dim(attitude) describe(attitude) lowerCor(attitude) ``` # Apply the standard 'lm' function ```{r} summary(lm(rating ~ complaints + privileges,data=attitude)) ``` ## Compare this to the `setCor' function ```{r} mod <-setCor(rating ~ complaints + privileges,data=attitude,std=FALSE) mod ``` ## Standardized regression ```{r} mod <-setCor(rating ~ complaints + privileges,data=attitude) mod names(mod) ``` Mike supplied a Venn Diagram of overlapping prdictors a is the unique part of X1 and Y b is the unique part of X2 and Y c is overlapping part of X1 , X2 and Y. The question is what is c? ```{r} r <- lowerCor(attitude[1:3]) a_c = r[1,2]^2 #the variance of a + c is rˆ2 of X1 with Y b_c <- r[1,3]^2 #the variance in b + c is rˆ2 of X2 with Y R2 <- mod$R2 #from before a_b_c <- R2 #the total Rˆ2 c <- (R2 - (a_c + b_c))/2 #From the Venn diagram a_c b_c a_b_c c ``` ### What are the confidence intervals of c? We need to do a bootstrap to find this. ## First make a small function to find the coefficients we want ```{r} find_c <- function(formula, data) { model <- setCor(formula,data=data,plot=FALSE) r <- cor(model$data[,-1]) a_c = r[1,2]^2 b_c <- r[1,3]^2 a_b_c <- model$R2 c <- (R2 - (a_c + b_c))/2 betas<- model$coefficients result <- list(R2=model$R2,coeffs = betas,c=c) return(result)} ``` # try it ```{r} find_c (rating ~ complaints + privileges,data=attitude) #test it ``` # Create a bootstrap function to call 'find_c' multiple times ```{r} boot.function <- function(formula, data, n.iterations) { result <- list() nobs <- NROW(data) for(i in 1:n.iterations ){cases <- sample(nobs,nobs,replace=TRUE) boot.data <- data[cases,] result[[i]] <- find_c(formula, data=boot.data) } rn <- rownames(result[[1]][["coeffs"]]) nc <- NROW(result[[1]][["coeffs"]]) + 2 result.df <- data.frame(matrix(unlist(result),ncol=nc,byrow=TRUE)) colnames(result.df) <- c("R2", rn, "c term") return(result.df) } ``` #now run it (iterate 1000 times) ```{r} temp <- boot.function(rating ~ complaints + privileges,attitude,1000) describe(temp) ```