--- title: '350 Week 8: Writing Functions' author: "William Revelle" date: "05/08/2023" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=110) ``` See the accompanying slides at https://personality-project.org/courses/350/350.manipulating.data.pdf and https://personality-project.org/courses/350/350.functions.pdf # psych and psychTools have been updated (to 2.3.5) Get the most recent version The next lines (install.packages) is commented out, but you need to do it at least once. The ```{r} #install.packages("psych",repos="http://personality-project.org/r", type="source") #install.packages("psychTools",repos="http://personality-project.org/r", type="source") library(psych) #make it active library(psychTools) sessionInfo() #psych should be 2.3.5 and psychTools should be 2.3.5 ``` # Two functions have been added to psychTools `vJoin` and `recode` ## vJoin (to join two objects by variable names) ```{r vJoin} x <- matrix(1:30, ncol=5) y <- matrix(1:40, ncol=8) x y xy <- vJoin(x,y) xy headTail(xy, top=2, bottom=2) XY <- vJoin(x,y, cnames=FALSE) XY ``` ## Examine `vJoin` ```{r} ?vJoin #to see the help file vJoin #to see the code ``` ## Another example (with variable names already) ```{r} temp <- bfi[1:10,1:6] #a small subset of bfi temp2 <- bfi[5:15,4:8] #an overlapping subset temp12 <- vJoin(temp,temp2) temp12 ``` # Recoding data If some data are character with levels that do not make sense in terms of some ordinal variable, we can recode them. First we create a file with levels of education that we think make sense ```{r} education <- cs(high_school, some_college, college, in_graduate_school, graduate_degree) temp <- recode(bfi,where=27, isvalue=1:5,newvalue=education) bfi[50:70,27] #show the education levels temp[50:70,27] #they correspond table(bfi[27]) table(temp[27]) #but these are not ordered by the numeric level #convert to numeric ``` ### When we describe the data, it is even worse ```{r} describe(bfi[26:28]) describe(temp[26:28]) temp2 <- char2numeric(temp) describe(temp2[26:28]) ``` How can recode these to make sense? ```{r} new.education <- c(3,5,1,4,2) temp3 <- recode(temp2,27,isvalue=1:6, newvalue=new.education) describe(temp3[26:28]) ``` # Writing functions Three basic approaches to using `R` 1) Do everything from console -- run each line as you think about it. Basic step through your operations, use `R` like a desk calculator Try a command (calling some functions from a package), 2) Use the text window and write out a few lines at a time and then run them (ala what we have been doing in class) Keep a list of your most useful set of lines Annotate them to remind you what you have done Use RMarkdown to annotate your scripts and to show your output 3) Write little (initially) functions to which you can pass parameters and make into your own library of useful functions You can then `source` these functions when you do your work,. ## Some basic functions to help you get and move around your data `fn <- file.choose()` #dynamically open your directory to find a file `file.choose(new=TRUE)` create a new file. Uses the normal computer search functions, but the creates a new file. This might be just available in regular R on a Mac, but not RStudio. `read.table(filename)` Read tabular input from the file location \texttt{fn}. `read.csv(filename)` reads a comma separated file `read.delim(filename)` reads a tab delimited file `load(filename)` "loads" objects from an .Rds file `read.file(filename)`* combines \pfun{file.choose} and appropriate read command (e.g., \pfun{read.table}, \pfun{read.csv}, \pfun{load}, etc.). ### Learn how functions work by examining the code in functions One way to learn coding is to read functions. Try to understand how they work. As an example of reading files and using functions, lets look at the `read.file` function. Note that default values are specified but could be changed. ## Examine the effect of two functions (file_ext, switch) There are a number of different ways to read files, these depend upon the specific type of file. We look for many possible suffixes. Others can easily be added if we think we need to add them. We use a function (`file_ext` ) from the `tools' package. This finds out what the suffix (.csv, .sav, .txt, ...) is and then is used in the `switch' function. ```{r} read.file #lets look at what it does ``` `read.file` can be used for local files or remote locations. `file.choose` will search your directory for a file. file_ext then examines that file name to identify what it is. We do not run this function because it is dynamic and requires user input. ```{r} library(tools) file.name <- "/Users/WR/Box Sync/pmc_folder/Courses.18/350/datasets/cars.rda" file.name2 <- "/Users/WR/Box Sync/pmc_folder/Courses.18/350/datasets/glbwarm.sav" file.name3 <- "/Users/WR/Box Sync/pmc_folder/Courses.18/350/datasets/simulation.txt" suffix <- file_ext(file.name) suffix2 <- file_ext(file.name2) suffix3 <- file_ext(file.name3) suffix; suffix2; suffix3 #we can string several lines together using ; # now read these files cars <- read.file(file.name) dim(cars) #glb <- read.file(file.name2) #dim(glb) ``` # Basic descriptives of a file ```{r} ls() #show the objects in my workspace #rm(list=ls()) #CAREFUL, this will remove all the objects in your workspace! dim(sat.act) #how many rows and columns in the sat.act object names(sat.act) colnames(sat.act) head(sat.act) #show the first 5 lines tail(sat.act) #show the last lines headTail(sat.act) #combine head and tail mod <- lm(SATV ~ gender + age + education, data =sat.act) length(mod) #how many elements are in mod? names(mod) #what are their names str(mod) #what is the structure of mod summary(mod) #summary returns summary statistics appropriate for the object summary(sat.act) #but they are not very useful summaries for data describe(sat.act) #more useful summary statistics table(sat.act$education) #break this down for one column table(sat.act[c("education","gender")] ) #do the two way table t <- table(sat.act[c("education","gender")] ) #do it again, but save it as an object chisq.test(t) #do the chi square on this table ``` ## Creating matrices and data frames Two of the most useful data structures are matrices and data.frames. Matrices are all of one type (logical, character, numeric), while data frames can be mixtures of all data types. Create a small matrix with the first 1,000 digits. ```{r} x <- matrix(1:1000,ncol=10) y <- matrix(1:1000,nrow=100,byrow=TRUE) #specify the number of rows, headTail(x) headTail(y) #note how the elements are in a different order if we specify byrow xy <- cbind(x,y) #column bind them dim(xy) #what are the dimensions yx <- rbind(x,y) #row bind the data sets dim(yx) headTail(xy) headTail(yx) ``` Matrices can be turned into data frames if we want ```{r} x.df <- data.frame(x) headTail(x.df) ``` But we can do things to dataframes we can not do to matrices ```{r} condition <- gl(2, 8, labels = c("Control", "Treat")) x.df <- data.frame(condition,results=rnorm(16)) x.df describe(x.df) nrow(x.df) #how many rows ncol(x.df) #how many columns ``` ## Some basic statistics ```{r} mean(x) mean(y) colMeans(x) colMeans(y) #use the apply function, applied to the columns of the x cm <- apply(x,2,mean) #same as column mean cm ssd <- apply(x,2,sd) #column standard deviations round(ssd,digits=2) smed <- apply(x,2,median) smed #show the column medians ``` What is the structure of the result of an apply operation? ```{r} x <- matrix(1:1000,ncol=10) "dumbfunction" <- function(x) { #this one gives bad results? why? nvar <- ncol(x) nsub <- nrow(x) mean <- rep(NA,nvar) sum <- 0 for(j in 1:nvar) { for (i in 1:nsub) { sum <- sum+ x[i,j] } mean[j] <- sum/nsub } print(mean) } dumbfunction(x) #why are these results non-sensical? ``` ## Examine the results. Do they pass a sanity test. The previous results don't make any sense. For how can the means exceed the values. We need to move where we zero out the counting. ```{r} "notquitsodumb" <- function(x) { #this one gives better results? why? nvar <- ncol(x) nsub <- nrow(x) mean <- rep(NA,nvar) for(j in 1:nvar) { sum <- 0 #we set the counter to 0 for each variable for (i in 1:nsub) { sum <- sum+ x[i,j] } mean[j] <- sum/nsub } print(mean) } notquitsodumb (x) ``` # Simulation study We want to compare the cross validities of using unit weighted scales (that is to say, just add them up) versus factor weighted scales (requires factor analysis and then uses the weights from the factor loadings to find the scores.) We are interested in what happens if we apply the same weights to a different sample. We do this 100 times! 1) the \pfun{spi} data set has 4,000 observations with 10 criteria and 135 personality items 2) The Neuroticism scale of 14 items can be used as an example 3) For each of 100 random samples of size 200 from the 4000, Form unit weighted scores Form factor scores (based upon a one dimensional factor analysis) Find the validities of these scores for the 10 criteria Cross validate on the other 3800 subjects 4) Pool these 100 samples 5) Graph the results ## Simulation of 100 replications of scoring ```{r} #use the spi data set and do it for neurocitsm N.items <- selectFromKeys(spi.keys["Neuro"]) spi.N <- spi[N.items] N.keys <- spi.keys["Neuro"] #spi.ext.crit <- data.frame(spi[,1:10],spi.E) n.obs <- 200 results <- list() for(i in 1:100) { ss <- sample (4000,n.obs,replace=FALSE) f.ss <- fa(spi.N [ss,]) #find a one factor solution and score it scores.ss <- data.frame(unit=scoreFast(N.keys,spi.N[ss,]),factor=f.ss$scores) validities <- cor(scores.ss,spi[ss,1:10],use="pairwise") cross <- data.frame(unit.cross=scoreFast(N.keys,spi.N[-ss,]),factor.cross = predict(f.ss,spi.N[-ss,]) ) cross.validities <- cor(cross,spi[-ss,1:10],use="pairwise" ) results[[i]]<-c(validities,cross.validities) } summary <- matrix(unlist(results),ncol=40,byrow=TRUE) colnames(summary) <- c(paste0("unit",colnames(spi)[1:10]) , paste0("factor",colnames(spi)[1:10]),paste0("unit.cross",colnames(spi)[1:10]), paste0("factor.cross",colnames(spi)[1:10])) describe(summary) #for textual output error.dots(summary,head=20,tail=20,main ="Unit weighted and factor scores derivation and cross validation for Neuroticism (with 95% confidence )") #show it abline(v=0) ```