---
title: "350.week10  Advanced programming"
author: "William Revelle"
date: "05/20/2024"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=100)
```

# psych and psychTools packages have been updated  to 2.4.4

Two new functions have been added to `psychTools`.  These are nice examples for today's class.

Check the news to see what has changed.
```{r news}
packageDate("psychTools") #should be >= 05/10/24
packageDate("psych" # should be >= 05/17/24
)
news(Version =="2.4.4",package="psych")
# I have already installed this so I am commenting it out
#install.packages("psych",repos="https://personality-project.org/r",type="source")
```

Make `psych` and `psychTools` active.
```{r psych}
library(psych)
library(psychTools)
```

# Merging and file manipulation

It is frequently necessary to combine information from different files into one unified object.  The `lookupFromKeys` function is an example of this. 

We have a list of item scoring keys (a *keyslist*) which tells us which items are to be scored from a data file. We have another file, an item *dictionary* which has the content of the items given.  We have yet another file which may be created dynamically using either `scoreItems` or `scoreOverlap` and is the correlation of each item with each scale. We want to examine the item content of each scale, and include the item by scale correlation.  Thus, we need to combine the keys, the dictionary, and the correlations.  Lets first look at the result, and then examine how the function works.

```{r lookupfromKeys}
bfi.keys  #show the keys for the bfi
bfi.scores <- scoreItems(bfi.keys,bfi)#find the scores
names(bfi.scores)  #what are the objects included?
bfi.overlap <- scoreOverlap(bfi.keys,bfi)#find the correlations corrected for overlap
#compare the two sets of correlations
combined.df <- data.frame(raw=bfi.scores$item.cor,corrected= bfi.overlap$item.cor)
difference <- combined.df[1:5] - combined.df[6:10]
round(difference,2) #show the differences

#now, lookup the items from the dictionary and combine them with the correlations
lookupFromKeys(bfi.keys,dictionary=bfi.dictionary[1:3],cors=bfi.overlap$item.cor,n=5)
```

## How does this function work?  Lets examine it

Just ask for it by name.  This unfortunately does not include the comments.  I have copied the function from the source file to include some comments.


```{r}
#adjusted 11/15/20 to add correlations if provided
 "lookupFromKeys" <- 
function(keys.list,dictionary,n=20,cors = NULL,sort=TRUE,suppress.names=FALSE,digits=2){
n.scales <- length(keys.list)
results <- item.cors <- result.df <- list()
for(i in 1:n.scales) {
  list.name <- names(keys.list[i])
  list.i <- keys.list[[i]]
   keys <- rep(1,length(list.i))[1:(min(n,length(list.i)))]
    neg <- grep("-", list.i[1:(min(n,length(list.i)))]) #find the negative keyed items
    keys[neg] <- -1
  select <- sub("-", "", list.i) #get rid of negative signs
  results[[i]] <- lookup(select[1:(min(n,length(list.i)))],dictionary)
  
 if(!is.null(rownames(results[[i]])[keys < 0]))  rownames(results[[i]])[keys < 0] <- paste0(rownames(results[[i]])[keys<0],"-") #put them in

 if(!is.null(cors)) { item.cors[[i]] <- round(cors[select[1:(min(n,length(select)))],i],digits=digits)
 result.df[[i]] <- data.frame(results[[i]],cors=item.cors[[i]])
 if(sort) {
      ord <- order(abs(item.cors[[i]]),decreasing=TRUE)
      result.df[[i]] <- result.df[[i]][ord,]
 }
 } else {result.df[[i]] <- data.frame(results[[i]])} #results[[i]] <- c(results[[i]],cors= round(cors[select[1:n],i],digits=digits))
  if(suppress.names) names(results[[i]]) <- ""
 
 # names(results[i]) <- list.name
}
names(result.df) <- names(keys.list)
return(result.df)} 


#this function also calls lookup

  #lookup which x's are found in y[c],return matches for y[]
 
 "lookup" <- 
function(x,y,criteria=NULL,keep.na=FALSE) {
if (is.null(criteria)) {temp <- match(x,rownames(y))} else {
     temp <- match(x,y[,criteria])}
     if(any(!is.na(temp))) {	
 y <- (y[temp[!is.na(temp)],,drop=FALSE]) } else {y <- NA}
return(y)}
```

## We notice that `lookupFromKeys` calls some other functions -- how do they work?

`grep` is one of the `base` functions for *Pattern Matching and Replacement*. (grep or *global regular expression* search ) According to its help page "grep, grepl, regexpr, gregexpr and regexec search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.

`sub` and `gsub` perform replacement of the first and all matches respectively."

The call to `grep` is:

     grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)
     
     grep(value = FALSE) returns a vector of the indices of the elements of x that yielded a match (or not, for invert = TRUE). This will be an integer vector unless the input is a long vector, when it will be a double vector.
 
 That is, `grep` searches the object `x` for the pattern `pattern` and returns the value TRUE if it finds it, FALSE if it does not.  If x is a vector, then `grep` returns a vector of where in x it found the pattern.  
 
 ### demonstrate `grep`
 
```{r grep}
 set.seed(42)  #so you get the same results
  x <- sample(10,20,replace=TRUE)
  x  #show the values
  grep(10,x)    #3 locations have value 10
  grep(9,x)     #and 4 location have value 9
```
 
and the help menu for `sub` says 

     `sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)'
    
    `sub` and `gsub` return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding). If useBytes = FALSE a non-ASCII substituted result will often be in UTF-8 with a marked encoding (e.g., if there is a UTF-8 input, and in a multibyte locale unless fixed = TRUE). Such strings can be re-encoded by enc2native.

### show the use of `sub` and `gsub`
```{r sub}
x
y <- sub(9,"hi",x)  #converts to character string
y   #but why did it replace all of them?
#x was a vector
x.string <- "This is a very long and convoluted sentence.  We want to search it for all occasions of `a` and then change them"
sub('a','A', x.string)  #just the first one is changed
gsub('a','A', x.string) #they are all changed (*global sub*)
```

## We also use the `lookup` function from `psych`

```{r lookup}
lookup
```

This calls `match`, another `base` function.  According to its help page, 

     `match` match returns a vector of the positions of (first) matches of its first argument in its second.

     %in% is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.
     
Try it

```{r match}
x
match(10,x)   #yes, 10 is in x (The 5th position)
10 %in% x     #it is still in x
x %in% 10   #and now we get a vector of where in x it is
```


## Here is the function with comments

```{r eval=FALSE}

#adjusted 11/15/20 to add correlations if provided
 "lookupFromKeys" <-   #the function name
function(keys.list,dictionary,n=1,cors = NULL,suppress.names=FALSE,digits=2){ #the parameters
n.scales <- length(keys.list)  #dynamically find out how many scales to work on
results <- item.cors <- result.df <- list()  #make up some lists to store the output
for(i in 1:n.scales) {   #a for loop will repeat this next section 
  list.name <- names(keys.list[i])  #get the name of the next list
  list.i <- keys.list[[i]]  #grab the next set of keys
   keys <- rep(1,length(list.i))[1:(min(n,length(list.i)))]  #create a dummy key of 1s
    neg <- grep("-", list.i[1:n])  #find which items are negatively keyed
    keys[neg] <- -1    #change those that are negative to -1
  select <- sub("-", "", list.i)   #drop all - signs from the list.i
  results[[i]] <- lookup(select[1:n],dictionary)  #find the n items in the dictionary
 if(!is.null(rownames(results[[i]])[keys < 0]))  rownames(results[[i]])[keys < 0] <- paste0(rownames(results[[i]])[keys<0],"-")  #put the negative sign back in
#these next few lines see if we want to add the correlations
 if(!is.null(cors)) { item.cors[[i]] <- round(cors[select[1:n],i],digits=digits)
 result.df[[i]] <- data.frame(results[[i]],cors=item.cors[[i]])} else {result.df[[i]] <- data.frame(results[[i]])} #results[[i]] <- c(results[[i]],cors= round(cors[select[1:n],i],digits=digits))
  if(suppress.names) names(results[[i]]) <- ""   #do we want the names of the keys
 
 # names(results[i]) <- list.name
}    #end of our loop
names(result.df) <- names(keys.list)  #name the resulting data frame
return(result.df)}   #return the values found

```

## Show this function with the `bfi` items and dictionary

```{r}
lookupFromKeys(bfi.keys, bfi.dictionary[1:3],n=2)#only show the first 3 columns
```
## An improvement added two years ago (suggested by See Young Myaeng)

It would be nice to show the correlations of these items with their associated scale.
This is done by combining the output from the `scoreItems` or `scoreOverlap` function.
This was the example shown earlier.

```{r}
lookupFromKeys(bfi.keys,dictionary=bfi.dictionary[1:3],cors=bfi.overlap$item.cor,n=5)
```

## Another way to improve the function

Now that we have seen the output of `lookupFromKeys` with the correlation information, it is appealling to change the function so that it will sort the items in each scale by the absolute value of the correlations.  To do this, we take our function and change it a little bit.  We will use the `order` function.

     `order` returns a permutation which rearranges its first argument into ascending or descending order, breaking ties by further arguments. sort.list does the same, using only one argument.
     
 ```{r order}
 y <- order(x)   #find the order
 y   #show them
 x[y]  #reorganize x in terms of this order
 z <- order(x,decreasing=TRUE)
 x[z]  #show them in decreasing order
 ```
 
See the examples for how to use these functions to sort data frames, etc. 
Also see dfOrder in `psychTools`


```{r better lookup}
#adjusted 11/15/20 to add correlations if provided
 "lookupFromKeys2" <-   #the function name  (we use this name while  testing it)
function(keys.list,dictionary,n=1,cors = NULL,sort=TRUE, suppress.names=FALSE,digits=2){ # add sort to option in the parameter list
n.scales <- length(keys.list)  #dynamically find out how many scales to work on
results <- item.cors <- result.df <- list()  #make up some lists to store the output
for(i in 1:n.scales) {   #a for loop will repeat this next section 
  list.name <- names(keys.list[i])  #get the name of the next list
  list.i <- keys.list[[i]]  #grab the next set of keys
   keys <- rep(1,length(list.i))[1:(min(n,length(list.i)))]  #create a dummy key of 1s
    neg <- grep("-", list.i[1:n])  #find which items are negatively keyed
    keys[neg] <- -1    #change those that are negative to -1
  select <- sub("-", "", list.i)   #drop all - signs from the list.i
  results[[i]] <- lookup(select[1:n],dictionary)  #find the n items in the dictionary
 if(!is.null(rownames(results[[i]])[keys < 0]))  rownames(results[[i]])[keys < 0] <- paste0(rownames(results[[i]])[keys<0],"-")  #put the negative sign back in

  #this next few lines sees if we want to add the correlations
 if(!is.null(cors)) { item.cors[[i]] <- round(cors[select[1:n],i],digits=digits)
 result.df[[i]] <- data.frame(results[[i]],cors=item.cors[[i]])
 if(sort) {
      ord <- order(abs(item.cors[[i]]),decreasing=TRUE)
      result.df[[i]] <- result.df[[i]][ord,]
 }
 
 } else {result.df[[i]] <- data.frame(results[[i]])} #results[[i]] <- c(results[[i]],cors= round(cors[select[1:n],i],digits=digits))
  if(suppress.names) names(results[[i]]) <- ""   #do we want the names of the keys
 
 # names(results[i]) <- list.name
}    #end of our loop
names(result.df) <- names(keys.list)  #name the resulting data frame
return(result.df)}   #return the values found
```

## Now use this improved function on our example

```{r}
lookupFromKeys2(bfi.keys, dictionary=bfi.dictionary[1:3], cors=bfi.scores$item.cor,n=5)
```

## Are there other ways we can improve this function?

Once you get a basic function to work, it is tempting (addictively?) to keep making it better. This is partly in response to user requests.

In fact, the current version of `psychTools` has include the improved version since 2020.

# More useful functions for data presentation

## matSort will organize a correlation matrix by some structural variable

```{r}
R <- lowerCor(sai[4:23])
corPlot(R,main="Alabama need not come first",cex=.4) #doesn't have any clear order 
f2 <- fa(R,2)
Rs <- matSort(R,f2)
corPlot(Rs,xlas=3,main='Sorted by factor loadings', cex=.4)
```

Lets look at how `matSort` works.

```{r}
#Two alternative names, matSort is the prefered one.
"mat.sort" <- "matSort" <- 
function(m,f=NULL) {
if (is.null(f) ) {f <- fa(m) } #we gave it a factor matrix
if(is.list(f) && (!is.null(loadings(f)))) {load <- loadings(f)}  else {load <- f}
    load  <- as.matrix(load)
 	nitems <- NROW(load)
 	nfactors <-NCOL(load)
  	loads <- data.frame(item=seq(1:nitems),cluster=rep(0,nitems),unclass(load))
#first sort them into clusters
      
  		#first find the maximum for each row and assign it to that cluster
  		 loads$cluster <- apply(abs(load),1,which.max)
 		 ord <- sort(loads$cluster,index.return=TRUE)
  		loads[1:nitems,] <- loads[ord$ix,]
 		rownames(loads)[1:nitems] <- rownames(loads)[ord$ix]
 		 
  #now sort column wise
  #now sort the loadings that have their highest loading on each cluster
  
  		items <- table(loads$cluster)   #how many items are in each cluster?
  		first <- 1
  		item <- loads$item
		for (i in 1:length(items)) {
		if(items[i] > 0 ) {
				last <- first + items[i]- 1
				ord <- sort(abs(loads[first:last,i+2]),decreasing=TRUE,index.return=TRUE)
   				loads[first:last,3:(nfactors+2)] <- load[item[ord$ix+first-1],]
   				loads[first:last,1] <- item[ord$ix+first-1]
   				rownames(loads)[first:last] <- rownames(loads)[ord$ix+first-1]
   		 		first <- first + items[i]  }
          		 }  
      item.order <- loads[,1]    		 
   m  <- m[item.order,item.order]
   return(m)
}
```

## Two new functions that continue showing programming

`selectBy` and `splitBy`

```{r}
dim(bfi)
small <- selectBy(bfi, 'gender=2 & age < 35 & education > 3')
dim(small)
headTail(small, from=21) #just show the last 8 columns
```

How does this function work?  Look at it. It uses `grepl` and other functions.

```{r}
#written May 20, 2023
#copied the code from source
"selectBy" <- function(x,by) {#use a quasi formula input

   by <- gsub(" ","", by)   #this removes the spaces
     
   if(grepl("\\|",by)) {  AND <- FALSE
                    bb <- unlist(strsplit(by,"\\|"))} else {  #note to search for a | we have to escape it!
     
                    AND  <- TRUE
                     if(grepl(",",by)) {
                       bb <- unlist(strsplit(by,","))} else {
                       bb <- unlist(strsplit(by,"&"))} }
                      
    n.by <- length(bb)
    by <- isvalue <- notvalue <- lessthan <- morethan <- matrix(NA,ncol= n.by)
   
    eq <- grep("=",bb)   #find which operators were used
    lt <- grep("<",bb)   #returns a vector
    gr <- grep(">",bb)
    ne  <- grep("!=",bb)
   #prepare the relevant search parameters
    if(length(eq ) >0)  {temp <- unlist(strsplit(bb[eq],"="))
               by[eq] <- temp[1]
               isvalue[eq] <-as.numeric( temp[2])
               } 
    if( length(lt)  >0)  {temp <- unlist(strsplit(bb[lt],"<"))
               by[lt] <- temp[1]
               lessthan[lt] <- as.numeric( temp[2])
               } 
     if(length(gr) >0)  {temp <- unlist(strsplit(bb[gr],">"))
               by[gr] <- temp[1]
               morethan[gr] <- as.numeric( temp[2])
               } 
      if(length(ne) >0)  {temp <- unlist(strsplit(bb[eq],"!="))
               by[ne] <- temp[1]
               notvalue[ne] <-as.numeric( temp[2])
               } 

#make sure that the variable names are correct
 if(!all(by %in% colnames(x)))  {
                                 cat("\n Offending variables are ",by[!by %in% colnames(x) ],"\n")
                                 stop("Variables specified do not match the variables in the data. \nFix the names and try again")}  
#do this on y which serves as  pointers to x, rather than x  #then combine the pointers for & (and) | (or)
y <- matrix(TRUE,nrow=NROW(x), ncol=n.by)

for(i in 1:length(by)) {
	if(!is.na(isvalue[,i])) y[,i] <-  x[,by[i]]==isvalue[i]
	if(!is.na(notvalue[,i])) y[,i] <- (x[,by[i]]!= notvalue[i])
	if(!is.na(lessthan[,i])) y[,i] <- (x[,by[i]]< lessthan[i])
	if(!is.na(morethan[,i])) y[,i] <- (x[,by[i]] >  morethan[i])
}
if(AND ) {y <-apply(y,1,all)} else {y <- apply(y,1,any)} 
y[is.na(y) ] <- FALSE
return(x[y,])
}

```

Another function, using much of the same logic as `selectBy` is to convert specified variables into dichotomous (0,1) codes.  This is `splitBy`.  Note that the programming took advantage of the code in `selectBy`.

```{r}
#written May 20, 2023
"splitBy" <- function(x,by,new=FALSE) {#use a quasi formula input

   by <- gsub(" ","", by)   #this removes the spaces
     
   bb <- unlist(strsplit(by,",")) 
                     
                      
    n.by <- length(bb)
    by <- isvalue <- notvalue <- lessthan <- morethan <- matrix(NA,ncol= n.by)
   
    eq <- grep("=",bb)   #find which operators were used
    lt <- grep("<",bb)   #returns a vector
    gr <- grep(">",bb)
    ne  <- grep("!=",bb)
   #prepare the relevant search parameters
    if(length(eq ) >0)  {temp <- unlist(strsplit(bb[eq],"="))
               by[eq] <- temp[1]
               isvalue[eq] <-as.numeric( temp[2])
               } 
    if( length(lt)  >0)  {temp <- unlist(strsplit(bb[lt],"<"))
               by[lt] <- temp[1]
               lessthan[lt] <- as.numeric( temp[2])
               } 
     if(length(gr) >0)  {temp <- unlist(strsplit(bb[gr],">"))
               by[gr] <- temp[1]
               morethan[gr] <- as.numeric( temp[2])
               } 
      if(length(ne) >0)  {temp <- unlist(strsplit(bb[eq],"!="))
               by[ne] <- temp[1]
               notvalue[ne] <-as.numeric( temp[2])
               } 
     
#do this on y which serves as  pointers to x, rather than x  #then combine the pointers for & (and) | (or)
if(!all(by %in% colnames(x)))  {
                                 cat("\n Offending variables are ",by[!by %in% colnames(x) ],"\n")
                                 stop("Variables specified do not match the variables in the data. \nFix the names and try again")}  

y <- matrix(TRUE,nrow=NROW(x), ncol=n.by)
colnames(y) <- paste0(c(by),"2")
for(i in 1:length(by)) { #check each value as true or false
	if(!is.na(isvalue[,i])) y[,i] <-  x[,by[i]]==isvalue[i]  
	if(!is.na(notvalue[,i])) y[,i] <- (x[,by[i]]!= notvalue[i])
	if(!is.na(lessthan[,i])) y[,i] <- (x[,by[i]]< lessthan[i])
	if(!is.na(morethan[,i])) y[,i] <- (x[,by[i]] >  morethan[i])
}
#convert to numeric
y <- y +0   #convert TRUE FALSE to numeric
if(new){return(y)} else {return(cbind(x,y))}
}

```

Try this function, on the `bfi` data set again.
Dichotomize the age variable into high and low, and education as high and low.

```{r}
bfi2 <- splitBy(bfi,'age < 25, education > 3')
lowerCor(bfi2[26:30]) #note that age2 1 means 1
describe(bfi2[26:30],skew=FALSE,range=FALSE)
```

% df2latex % dd 
\begin{table}[htpb]\caption{df2latex}
\begin{center}
\begin{scriptsize} 
\begin{tabular} {l r r r r r r r r r r r r r }
 \multicolumn{ 13 }{l}{ A table from the psych package in R } \cr 
 \hline Variable  &   {vars} &  {n} &  {mean} &  {sd} &  {medin} &  {trmmd} &  {mad} &  {min} &  {max} &  {range} &  {skew} &  {krtss} &  {se}\cr 
  \hline 
gender   &  1  &  700  &    1.65  &    0.48  &    2  &    1.68  &    0.00  &    1  &    2  &    1  &  -0.61  &  -1.62  &  0.02 \cr 
 education   &  2  &  700  &    3.16  &    1.43  &    3  &    3.31  &    1.48  &    0  &    5  &    5  &  -0.68  &  -0.07  &  0.05 \cr 
 age   &  3  &  700  &   25.59  &    9.50  &   22  &   23.86  &    5.93  &   13  &   65  &   52  &   1.64  &   2.42  &  0.36 \cr 
 ACT   &  4  &  700  &   28.55  &    4.82  &   29  &   28.84  &    4.45  &    3  &   36  &   33  &  -0.66  &   0.53  &  0.18 \cr 
 SATV   &  5  &  700  &  612.23  &  112.90  &  620  &  619.45  &  118.61  &  200  &  800  &  600  &  -0.64  &   0.33  &  4.27 \cr 
 SATQ   &  6  &  687  &  610.22  &  115.64  &  620  &  617.25  &  118.61  &  200  &  800  &  600  &  -0.59  &  -0.02  &  4.41 \cr 
 \hline 
\end{tabular}
\end{scriptsize}
\end{center}
\label{default}
\end{table}