--- title: "350.week10 Advanced programming" author: "William Revelle" date: "05/20/2024" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=100) ``` # psych and psychTools packages have been updated to 2.4.4 Two new functions have been added to `psychTools`. These are nice examples for today's class. Check the news to see what has changed. ```{r news} packageDate("psychTools") #should be >= 05/10/24 packageDate("psych" # should be >= 05/17/24 ) news(Version =="2.4.4",package="psych") # I have already installed this so I am commenting it out #install.packages("psych",repos="https://personality-project.org/r",type="source") ``` Make `psych` and `psychTools` active. ```{r psych} library(psych) library(psychTools) ``` # Merging and file manipulation It is frequently necessary to combine information from different files into one unified object. The `lookupFromKeys` function is an example of this. We have a list of item scoring keys (a *keyslist*) which tells us which items are to be scored from a data file. We have another file, an item *dictionary* which has the content of the items given. We have yet another file which may be created dynamically using either `scoreItems` or `scoreOverlap` and is the correlation of each item with each scale. We want to examine the item content of each scale, and include the item by scale correlation. Thus, we need to combine the keys, the dictionary, and the correlations. Lets first look at the result, and then examine how the function works. ```{r lookupfromKeys} bfi.keys #show the keys for the bfi bfi.scores <- scoreItems(bfi.keys,bfi)#find the scores names(bfi.scores) #what are the objects included? bfi.overlap <- scoreOverlap(bfi.keys,bfi)#find the correlations corrected for overlap #compare the two sets of correlations combined.df <- data.frame(raw=bfi.scores$item.cor,corrected= bfi.overlap$item.cor) difference <- combined.df[1:5] - combined.df[6:10] round(difference,2) #show the differences #now, lookup the items from the dictionary and combine them with the correlations lookupFromKeys(bfi.keys,dictionary=bfi.dictionary[1:3],cors=bfi.overlap$item.cor,n=5) ``` ## How does this function work? Lets examine it Just ask for it by name. This unfortunately does not include the comments. I have copied the function from the source file to include some comments. ```{r} #adjusted 11/15/20 to add correlations if provided "lookupFromKeys" <- function(keys.list,dictionary,n=20,cors = NULL,sort=TRUE,suppress.names=FALSE,digits=2){ n.scales <- length(keys.list) results <- item.cors <- result.df <- list() for(i in 1:n.scales) { list.name <- names(keys.list[i]) list.i <- keys.list[[i]] keys <- rep(1,length(list.i))[1:(min(n,length(list.i)))] neg <- grep("-", list.i[1:(min(n,length(list.i)))]) #find the negative keyed items keys[neg] <- -1 select <- sub("-", "", list.i) #get rid of negative signs results[[i]] <- lookup(select[1:(min(n,length(list.i)))],dictionary) if(!is.null(rownames(results[[i]])[keys < 0])) rownames(results[[i]])[keys < 0] <- paste0(rownames(results[[i]])[keys<0],"-") #put them in if(!is.null(cors)) { item.cors[[i]] <- round(cors[select[1:(min(n,length(select)))],i],digits=digits) result.df[[i]] <- data.frame(results[[i]],cors=item.cors[[i]]) if(sort) { ord <- order(abs(item.cors[[i]]),decreasing=TRUE) result.df[[i]] <- result.df[[i]][ord,] } } else {result.df[[i]] <- data.frame(results[[i]])} #results[[i]] <- c(results[[i]],cors= round(cors[select[1:n],i],digits=digits)) if(suppress.names) names(results[[i]]) <- "" # names(results[i]) <- list.name } names(result.df) <- names(keys.list) return(result.df)} #this function also calls lookup #lookup which x's are found in y[c],return matches for y[] "lookup" <- function(x,y,criteria=NULL,keep.na=FALSE) { if (is.null(criteria)) {temp <- match(x,rownames(y))} else { temp <- match(x,y[,criteria])} if(any(!is.na(temp))) { y <- (y[temp[!is.na(temp)],,drop=FALSE]) } else {y <- NA} return(y)} ``` ## We notice that `lookupFromKeys` calls some other functions -- how do they work? `grep` is one of the `base` functions for *Pattern Matching and Replacement*. (grep or *global regular expression* search ) According to its help page "grep, grepl, regexpr, gregexpr and regexec search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results. `sub` and `gsub` perform replacement of the first and all matches respectively." The call to `grep` is: grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) grep(value = FALSE) returns a vector of the indices of the elements of x that yielded a match (or not, for invert = TRUE). This will be an integer vector unless the input is a long vector, when it will be a double vector. That is, `grep` searches the object `x` for the pattern `pattern` and returns the value TRUE if it finds it, FALSE if it does not. If x is a vector, then `grep` returns a vector of where in x it found the pattern. ### demonstrate `grep` ```{r grep} set.seed(42) #so you get the same results x <- sample(10,20,replace=TRUE) x #show the values grep(10,x) #3 locations have value 10 grep(9,x) #and 4 location have value 9 ``` and the help menu for `sub` says `sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)' `sub` and `gsub` return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding). If useBytes = FALSE a non-ASCII substituted result will often be in UTF-8 with a marked encoding (e.g., if there is a UTF-8 input, and in a multibyte locale unless fixed = TRUE). Such strings can be re-encoded by enc2native. ### show the use of `sub` and `gsub` ```{r sub} x y <- sub(9,"hi",x) #converts to character string y #but why did it replace all of them? #x was a vector x.string <- "This is a very long and convoluted sentence. We want to search it for all occasions of `a` and then change them" sub('a','A', x.string) #just the first one is changed gsub('a','A', x.string) #they are all changed (*global sub*) ``` ## We also use the `lookup` function from `psych` ```{r lookup} lookup ``` This calls `match`, another `base` function. According to its help page, `match` match returns a vector of the positions of (first) matches of its first argument in its second. %in% is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand. Try it ```{r match} x match(10,x) #yes, 10 is in x (The 5th position) 10 %in% x #it is still in x x %in% 10 #and now we get a vector of where in x it is ``` ## Here is the function with comments ```{r eval=FALSE} #adjusted 11/15/20 to add correlations if provided "lookupFromKeys" <- #the function name function(keys.list,dictionary,n=1,cors = NULL,suppress.names=FALSE,digits=2){ #the parameters n.scales <- length(keys.list) #dynamically find out how many scales to work on results <- item.cors <- result.df <- list() #make up some lists to store the output for(i in 1:n.scales) { #a for loop will repeat this next section list.name <- names(keys.list[i]) #get the name of the next list list.i <- keys.list[[i]] #grab the next set of keys keys <- rep(1,length(list.i))[1:(min(n,length(list.i)))] #create a dummy key of 1s neg <- grep("-", list.i[1:n]) #find which items are negatively keyed keys[neg] <- -1 #change those that are negative to -1 select <- sub("-", "", list.i) #drop all - signs from the list.i results[[i]] <- lookup(select[1:n],dictionary) #find the n items in the dictionary if(!is.null(rownames(results[[i]])[keys < 0])) rownames(results[[i]])[keys < 0] <- paste0(rownames(results[[i]])[keys<0],"-") #put the negative sign back in #these next few lines see if we want to add the correlations if(!is.null(cors)) { item.cors[[i]] <- round(cors[select[1:n],i],digits=digits) result.df[[i]] <- data.frame(results[[i]],cors=item.cors[[i]])} else {result.df[[i]] <- data.frame(results[[i]])} #results[[i]] <- c(results[[i]],cors= round(cors[select[1:n],i],digits=digits)) if(suppress.names) names(results[[i]]) <- "" #do we want the names of the keys # names(results[i]) <- list.name } #end of our loop names(result.df) <- names(keys.list) #name the resulting data frame return(result.df)} #return the values found ``` ## Show this function with the `bfi` items and dictionary ```{r} lookupFromKeys(bfi.keys, bfi.dictionary[1:3],n=2)#only show the first 3 columns ``` ## An improvement added two years ago (suggested by See Young Myaeng) It would be nice to show the correlations of these items with their associated scale. This is done by combining the output from the `scoreItems` or `scoreOverlap` function. This was the example shown earlier. ```{r} lookupFromKeys(bfi.keys,dictionary=bfi.dictionary[1:3],cors=bfi.overlap$item.cor,n=5) ``` ## Another way to improve the function Now that we have seen the output of `lookupFromKeys` with the correlation information, it is appealling to change the function so that it will sort the items in each scale by the absolute value of the correlations. To do this, we take our function and change it a little bit. We will use the `order` function. `order` returns a permutation which rearranges its first argument into ascending or descending order, breaking ties by further arguments. sort.list does the same, using only one argument. ```{r order} y <- order(x) #find the order y #show them x[y] #reorganize x in terms of this order z <- order(x,decreasing=TRUE) x[z] #show them in decreasing order ``` See the examples for how to use these functions to sort data frames, etc. Also see dfOrder in `psychTools` ```{r better lookup} #adjusted 11/15/20 to add correlations if provided "lookupFromKeys2" <- #the function name (we use this name while testing it) function(keys.list,dictionary,n=1,cors = NULL,sort=TRUE, suppress.names=FALSE,digits=2){ # add sort to option in the parameter list n.scales <- length(keys.list) #dynamically find out how many scales to work on results <- item.cors <- result.df <- list() #make up some lists to store the output for(i in 1:n.scales) { #a for loop will repeat this next section list.name <- names(keys.list[i]) #get the name of the next list list.i <- keys.list[[i]] #grab the next set of keys keys <- rep(1,length(list.i))[1:(min(n,length(list.i)))] #create a dummy key of 1s neg <- grep("-", list.i[1:n]) #find which items are negatively keyed keys[neg] <- -1 #change those that are negative to -1 select <- sub("-", "", list.i) #drop all - signs from the list.i results[[i]] <- lookup(select[1:n],dictionary) #find the n items in the dictionary if(!is.null(rownames(results[[i]])[keys < 0])) rownames(results[[i]])[keys < 0] <- paste0(rownames(results[[i]])[keys<0],"-") #put the negative sign back in #this next few lines sees if we want to add the correlations if(!is.null(cors)) { item.cors[[i]] <- round(cors[select[1:n],i],digits=digits) result.df[[i]] <- data.frame(results[[i]],cors=item.cors[[i]]) if(sort) { ord <- order(abs(item.cors[[i]]),decreasing=TRUE) result.df[[i]] <- result.df[[i]][ord,] } } else {result.df[[i]] <- data.frame(results[[i]])} #results[[i]] <- c(results[[i]],cors= round(cors[select[1:n],i],digits=digits)) if(suppress.names) names(results[[i]]) <- "" #do we want the names of the keys # names(results[i]) <- list.name } #end of our loop names(result.df) <- names(keys.list) #name the resulting data frame return(result.df)} #return the values found ``` ## Now use this improved function on our example ```{r} lookupFromKeys2(bfi.keys, dictionary=bfi.dictionary[1:3], cors=bfi.scores$item.cor,n=5) ``` ## Are there other ways we can improve this function? Once you get a basic function to work, it is tempting (addictively?) to keep making it better. This is partly in response to user requests. In fact, the current version of `psychTools` has include the improved version since 2020. # More useful functions for data presentation ## matSort will organize a correlation matrix by some structural variable ```{r} R <- lowerCor(sai[4:23]) corPlot(R,main="Alabama need not come first",cex=.4) #doesn't have any clear order f2 <- fa(R,2) Rs <- matSort(R,f2) corPlot(Rs,xlas=3,main='Sorted by factor loadings', cex=.4) ``` Lets look at how `matSort` works. ```{r} #Two alternative names, matSort is the prefered one. "mat.sort" <- "matSort" <- function(m,f=NULL) { if (is.null(f) ) {f <- fa(m) } #we gave it a factor matrix if(is.list(f) && (!is.null(loadings(f)))) {load <- loadings(f)} else {load <- f} load <- as.matrix(load) nitems <- NROW(load) nfactors <-NCOL(load) loads <- data.frame(item=seq(1:nitems),cluster=rep(0,nitems),unclass(load)) #first sort them into clusters #first find the maximum for each row and assign it to that cluster loads$cluster <- apply(abs(load),1,which.max) ord <- sort(loads$cluster,index.return=TRUE) loads[1:nitems,] <- loads[ord$ix,] rownames(loads)[1:nitems] <- rownames(loads)[ord$ix] #now sort column wise #now sort the loadings that have their highest loading on each cluster items <- table(loads$cluster) #how many items are in each cluster? first <- 1 item <- loads$item for (i in 1:length(items)) { if(items[i] > 0 ) { last <- first + items[i]- 1 ord <- sort(abs(loads[first:last,i+2]),decreasing=TRUE,index.return=TRUE) loads[first:last,3:(nfactors+2)] <- load[item[ord$ix+first-1],] loads[first:last,1] <- item[ord$ix+first-1] rownames(loads)[first:last] <- rownames(loads)[ord$ix+first-1] first <- first + items[i] } } item.order <- loads[,1] m <- m[item.order,item.order] return(m) } ``` ## Two new functions that continue showing programming `selectBy` and `splitBy` ```{r} dim(bfi) small <- selectBy(bfi, 'gender=2 & age < 35 & education > 3') dim(small) headTail(small, from=21) #just show the last 8 columns ``` How does this function work? Look at it. It uses `grepl` and other functions. ```{r} #written May 20, 2023 #copied the code from source "selectBy" <- function(x,by) {#use a quasi formula input by <- gsub(" ","", by) #this removes the spaces if(grepl("\\|",by)) { AND <- FALSE bb <- unlist(strsplit(by,"\\|"))} else { #note to search for a | we have to escape it! AND <- TRUE if(grepl(",",by)) { bb <- unlist(strsplit(by,","))} else { bb <- unlist(strsplit(by,"&"))} } n.by <- length(bb) by <- isvalue <- notvalue <- lessthan <- morethan <- matrix(NA,ncol= n.by) eq <- grep("=",bb) #find which operators were used lt <- grep("<",bb) #returns a vector gr <- grep(">",bb) ne <- grep("!=",bb) #prepare the relevant search parameters if(length(eq ) >0) {temp <- unlist(strsplit(bb[eq],"=")) by[eq] <- temp[1] isvalue[eq] <-as.numeric( temp[2]) } if( length(lt) >0) {temp <- unlist(strsplit(bb[lt],"<")) by[lt] <- temp[1] lessthan[lt] <- as.numeric( temp[2]) } if(length(gr) >0) {temp <- unlist(strsplit(bb[gr],">")) by[gr] <- temp[1] morethan[gr] <- as.numeric( temp[2]) } if(length(ne) >0) {temp <- unlist(strsplit(bb[eq],"!=")) by[ne] <- temp[1] notvalue[ne] <-as.numeric( temp[2]) } #make sure that the variable names are correct if(!all(by %in% colnames(x))) { cat("\n Offending variables are ",by[!by %in% colnames(x) ],"\n") stop("Variables specified do not match the variables in the data. \nFix the names and try again")} #do this on y which serves as pointers to x, rather than x #then combine the pointers for & (and) | (or) y <- matrix(TRUE,nrow=NROW(x), ncol=n.by) for(i in 1:length(by)) { if(!is.na(isvalue[,i])) y[,i] <- x[,by[i]]==isvalue[i] if(!is.na(notvalue[,i])) y[,i] <- (x[,by[i]]!= notvalue[i]) if(!is.na(lessthan[,i])) y[,i] <- (x[,by[i]]< lessthan[i]) if(!is.na(morethan[,i])) y[,i] <- (x[,by[i]] > morethan[i]) } if(AND ) {y <-apply(y,1,all)} else {y <- apply(y,1,any)} y[is.na(y) ] <- FALSE return(x[y,]) } ``` Another function, using much of the same logic as `selectBy` is to convert specified variables into dichotomous (0,1) codes. This is `splitBy`. Note that the programming took advantage of the code in `selectBy`. ```{r} #written May 20, 2023 "splitBy" <- function(x,by,new=FALSE) {#use a quasi formula input by <- gsub(" ","", by) #this removes the spaces bb <- unlist(strsplit(by,",")) n.by <- length(bb) by <- isvalue <- notvalue <- lessthan <- morethan <- matrix(NA,ncol= n.by) eq <- grep("=",bb) #find which operators were used lt <- grep("<",bb) #returns a vector gr <- grep(">",bb) ne <- grep("!=",bb) #prepare the relevant search parameters if(length(eq ) >0) {temp <- unlist(strsplit(bb[eq],"=")) by[eq] <- temp[1] isvalue[eq] <-as.numeric( temp[2]) } if( length(lt) >0) {temp <- unlist(strsplit(bb[lt],"<")) by[lt] <- temp[1] lessthan[lt] <- as.numeric( temp[2]) } if(length(gr) >0) {temp <- unlist(strsplit(bb[gr],">")) by[gr] <- temp[1] morethan[gr] <- as.numeric( temp[2]) } if(length(ne) >0) {temp <- unlist(strsplit(bb[eq],"!=")) by[ne] <- temp[1] notvalue[ne] <-as.numeric( temp[2]) } #do this on y which serves as pointers to x, rather than x #then combine the pointers for & (and) | (or) if(!all(by %in% colnames(x))) { cat("\n Offending variables are ",by[!by %in% colnames(x) ],"\n") stop("Variables specified do not match the variables in the data. \nFix the names and try again")} y <- matrix(TRUE,nrow=NROW(x), ncol=n.by) colnames(y) <- paste0(c(by),"2") for(i in 1:length(by)) { #check each value as true or false if(!is.na(isvalue[,i])) y[,i] <- x[,by[i]]==isvalue[i] if(!is.na(notvalue[,i])) y[,i] <- (x[,by[i]]!= notvalue[i]) if(!is.na(lessthan[,i])) y[,i] <- (x[,by[i]]< lessthan[i]) if(!is.na(morethan[,i])) y[,i] <- (x[,by[i]] > morethan[i]) } #convert to numeric y <- y +0 #convert TRUE FALSE to numeric if(new){return(y)} else {return(cbind(x,y))} } ``` Try this function, on the `bfi` data set again. Dichotomize the age variable into high and low, and education as high and low. ```{r} bfi2 <- splitBy(bfi,'age < 25, education > 3') lowerCor(bfi2[26:30]) #note that age2 1 means 1 describe(bfi2[26:30],skew=FALSE,range=FALSE) ``` % df2latex % dd \begin{table}[htpb]\caption{df2latex} \begin{center} \begin{scriptsize} \begin{tabular} {l r r r r r r r r r r r r r } \multicolumn{ 13 }{l}{ A table from the psych package in R } \cr \hline Variable & {vars} & {n} & {mean} & {sd} & {medin} & {trmmd} & {mad} & {min} & {max} & {range} & {skew} & {krtss} & {se}\cr \hline gender & 1 & 700 & 1.65 & 0.48 & 2 & 1.68 & 0.00 & 1 & 2 & 1 & -0.61 & -1.62 & 0.02 \cr education & 2 & 700 & 3.16 & 1.43 & 3 & 3.31 & 1.48 & 0 & 5 & 5 & -0.68 & -0.07 & 0.05 \cr age & 3 & 700 & 25.59 & 9.50 & 22 & 23.86 & 5.93 & 13 & 65 & 52 & 1.64 & 2.42 & 0.36 \cr ACT & 4 & 700 & 28.55 & 4.82 & 29 & 28.84 & 4.45 & 3 & 36 & 33 & -0.66 & 0.53 & 0.18 \cr SATV & 5 & 700 & 612.23 & 112.90 & 620 & 619.45 & 118.61 & 200 & 800 & 600 & -0.64 & 0.33 & 4.27 \cr SATQ & 6 & 687 & 610.22 & 115.64 & 620 & 617.25 & 118.61 & 200 & 800 & 600 & -0.59 & -0.02 & 4.41 \cr \hline \end{tabular} \end{scriptsize} \end{center} \label{default} \end{table}