--- title: "350.wk.6b Steps towards programming" author: "William Revelle" date: "05/01/24" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=110) ``` ## Scripting procedures to do basic analysis Probably the most common use of R is just to do a set of analyses using a number of different functions. For instance, in analysing the data from a psychology experiment in psych 205, I prepared a few scripts that students could use. Similarly, to teach personality research, I could write out the scripts for simple scale construction. In ouline form these all follow the same concept get your data (using read.file or read.clipboard) describe the data (in terms of describe and perhaps pairs.panels) If forming scales, then create a set of scoring keys and then use scoreItems. If just one scale is needed, the alpha fuction will produce scores as well. Proceed to basic data analysis. ## When working on your own project, you could script the work you are planning on doing Even as you are collecting data for a study you are running, you can write the scripts ahead of time and test them out. That way, when your data are all collected, you are ready to go. ## Another procedure is to modify a known function to do what you want to do. Here I consider the `map` function from the package `maps`. Built into the package are the commands to draw maps of the world, the US, or the US by state and by county. ### Lets just try the example from the ?map help menu You must first install the `maps` package. Use the tools menu or enter the command `install.packages('maps')`. At that point you need to restart `R studio` ```{r map example} library(maps) #requires having the maps package map('usa') # national boundaries map('county', 'new jersey') # county map of New Jersey map('state', region = c('new york', 'new jersey', 'penn')) # map of three states map("state", ".*dakota", myborder = 0) # map of the dakotas map.axes() # show the effect of myborder = 0 ``` Here is an example of 2009 unemployment statistics ```{r} data(unemp) data(county.fips) # define color buckets colors = c("#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043") unemp$colorBuckets <- as.numeric(cut(unemp$unemp, c(0, 2, 4, 6, 8, 10, 100))) leg.txt <- c("<2%", "2-4%", "4-6%", "6-8%", "8-10%", ">10%") # align data with map definitions by (partial) matching state,county # names, which include multiple polygons for some counties cnty.fips <- county.fips$fips[match(map("county", plot=FALSE)$names, county.fips$polyname)] colorsmatched <- unemp$colorBuckets [match(cnty.fips, unemp$fips)] # draw map map("county", col = colors[colorsmatched], fill = TRUE, resolution = 0, lty = 0, projection = "polyconic") map("state", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 0.2, projection="polyconic") title("unemployment by county, 2009") legend("topright", leg.txt, horiz = TRUE, fill = colors) ``` The basic help file gives a nice example, which we can then modify to examine something more timely. Lets examine COVID-19 cases by county. To do this, we take the example from the `map` help file and modify if to treat our data. ## First, a small function to do the graphic by state or county ```{r} require(maps) # you need to install this first library(psych) library(psychTools) #adapted from the help file for the map function but set up to do more current data map.data <- function(data,var="cases",select.date="2020-10-19", n=10, by.state=TRUE, fips = "fips",state=NULL) { require(maps) #make sure that we have the maps package installed and loaded gr <- colorRampPalette(c("red", "blue")) colramp <- gr(n) #set the color range #get the data if(!is.null(select.date)) { x <- subset(data,date ==select.date) #the name of the data file we are using } else {x <- data} qx <- quantile(x[,var],probs=(seq(0,1,1/n)),na.rm=TRUE) #figure out the quantiles for the data if(by.state){ values <- x[match(state.fips$fips,x$fips),var] } else { values <- x[match(county.fips$fips,x$fips),var] } values <- as.numeric(cut(values,qx,include.lowest=TRUE)) #convert to quantiles if(!by.state){map("county",state, col = colramp[values], fill = TRUE, resolution = 0, lty = 1, projection = "polyconic", main=paste(var), border="gray", lwd=0.3) map("state", state, col = "black", fill = FALSE, add = TRUE, lty = 1, lwd = 1.0, projection="polyconic") title(main=c(paste(var,"by county on "),select.date) ) } else { map("state",state, col = colramp[values], fill = TRUE, resolution = 0, lty = 0, projection = "polyconic") title(main=c(paste(var,"by state on "), select.date)) } } ``` # Try running this function with some data downloaded from the New York Times. ```{r} fn <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv" state.data <- read.file(fn) dim(state.data) map.data(state.data) #using the default values ``` ### How to improve this plot? Clearly, the more populated states have more cases. We need to correct the state wide data by state population to get a rate per 100,000. This will require getting population by state. We can get this from the unemp data set in the maps package. We can also plot at the county level, if we can get the data. The NY Times has the data, but part of the file is corrupted. I put a cleaned version on the class server. https://personality-project.org/courses/350/covid.county.csv ```{r county} fn <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv" county <- read.file(fn) map.data(county,by.state=FALSE,select.date= "2020-10-19") ``` ## An example of population by county from the help page ```{r} data(unemp) dim(unemp) describe(unemp) map.data(unemp,var="pop",select.date=NULL,by.state=FALSE) #now do it for unemployment map.data(unemp,var="unemp",select.date=NULL,by.state=FALSE) ``` ## Use the population data by county to correct the COVID plot In the previous description file, we saw that one of the variables is `FIPS`. FIPS are 'Federal Information Processing Standards'. 'Federal Information Processing Standards (FIPS), now known as Federal Information Processing Series, are numeric codes assigned by the National Institute of Standards and Technology (NIST). Typically, FIPS codes deal with US states and counties. US states are identified by a 2-digit number, while US counties are identified by a 3-digit number. For example, a FIPS code of 06071, represents California -06 and San Bernardino County -071. We need to `merge` the county level covid data with the county level population data. We do the merge by matching on the FIPS codes and using the `merge` function. See the help page for `merge` for many options. Merging of data frames is a very important operation. Here we merge population values for 3,218 counties into a data frame that has 250K observations. The merge column is the `fips` code. ```{r} dim(county) dim(unemp) colnames(county) colnames(unemp) county.pop <- merge(county,unemp,by="fips") dim(county.pop) colnames(county.pop) county.pop$casesbypop <- county.pop$cases/county.pop$pop dim(county.pop) #examine the merged result headTail(county) headTail(unemp) headTail(county.pop) ``` We now used the merged data to plot population adjust trends. ```{r} map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= "2021-10-19") ``` ### lets show the time trends by showing the map for multiple months The nice thing about functions, instead of scripts, is that we can easily adjust them to make multiple calls. Here we show rates for multiple months. ```{r} dates <- c("2020-02-01","2020-03-01","2020-04-01","2020-05-01","2020-06-01","2020-08-01","2020-10-01","2020-12-01","2021-02-01","2021-04-01","2021-06-01","2021-08-01","2021-10-01","2021-12-01","2022-02-01","2022-04-01") for (i in 1:length(dates) ) { map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= dates[i]) } ``` ### Do it again, but with finer resolution for the first few months ```{r} dates <- c("2020-02-01","2020-03-01","2020-03-08","2020-03-15","2020-03-22","2020-03-29","2020-04-01","2020-04-15","2020-05-01","2020-06-01","2020-08-01","2020-10-01","2020-12-01","2022-04-01") for (i in 1:length(dates) ) { map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= dates[i]) } ``` ### One final tweak - show just Illinois ```{r} for (i in 1:length(dates) ) { map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= dates[i], state="Illinois") } ``` # Steps towards writing a function Although one can write little scripts to do processes, it is better to write (and save) short functions to do operations that you find useful. Here we consider the development and testing of a function. (Although based upon the m2t function, we will go through the steps of development, and then add some improvements.) ## Identify a need Just write functions that don't already exist and that do things that you find useful. ## Finding t-tests values from means, standard deviations, and sample sizes. The conventional t.test function will perform t-tests on data and then reports the mean. But sometimes we have reports of results where we are told the mean(s) and the standard deviation(s) and we want to calculate the t statistic. William Gosset, publishing under the name `student' developed test for the likelyhood of a particular mean given the hypothesis of a mean of zero. $$t = \frac{\bar{x}}{se_x}$$ where the standard error of x, $se_x$, varied by sample size and the observed standard deviation, $sd_x$. $$se_x = sd_x/\sqrt{(n)}$$ Can we write a function to find t given $\bar{x}$, $sd_x$ and n. ## Writing the function ### First name the function and specify what inputs it will take. Make sure to add comments so that you will know what the function is doing. ```{r} #A function to find the t statistic given a mean, sd and sample size mean2t <- function(mean, sd, n) { #all functions start and finish with {} t <- mean / (sd/sqrt(n)) print(t) } ``` Test the function against other functions. We will see if we get the same values as we get from the t.test function on the *Cushny* data set. We first will describe the Cushny data set and then test several values. ```{r} library(psych) library(psychTools) describe(cushny) print(describe(cushny),digits=4) #we need more accuracy t.test(cushny[2]) mean2t(mean=4.00,sd=2.1024,n=10) t.test(cushny[6]) mean2t(mean=2.33,sd= 2.0022,n=10) #try another one ``` ## Improve the function to treat 2 means #A function to find the t statistic given a mean, sd and sample size ```{r} means2t <- function(mean, sd, n, mean2,sd2,n2) { #all functions start and finish with {} if(missing(mean2)) {t <- mean / (sd/sqrt(n)) #the case of just one mean df <- n-1 } else { #the normal case pooled <- (sd^2 * n + sd2^2*n2)/(n+n2) sepooled <- sqrt(pooled/n + pooled/n2) t <- (mean - mean2)/sepooled df <- n + n2 -2 } cat("\n t = ",t, "df =", df, "\n") } ``` ## test the function against the sleep data ```{r} data(sleep) with(sleep, t.test(extra[group == 1], extra[group == 2],equal.var=TRUE)) # print(describeBy(sleep,"group"),digits=5) #we need this precision ``` ## test the function ```{r} means2t(.75,1.78901,10,2.33,2.00225,10) #we add the extra precision to match ``` ## Make it more general # A function to find the t statistic given a mean, sd and sample size ```{r} means2t <- function(mean, sd, n, mean2,sd2,n2,pooled=TRUE) { #all functions start and finish with {} if(missing(mean2)) {t <- mean / (sd/sqrt(n)) #the case of just one mean df <- n-1 } else { #the normal case if(pooled ) { pooled <- (sd^2 * (n-1) + sd2^2*(n2-1))/(n+n2-2) se <- sqrt(pooled/n + pooled/n2)} else {se <- sqrt(sd^2/n + sd2^2/n2) } t <- (mean - mean2)/se df <- n + n2 -2 } } #the function ends here ``` # Now run the function and compare it to the known value from the t-test function ```{r} means2t(.75,1.78901,10,2.33,2.00225,10) ``` # Something is missing We need to report the answer ```{r} means2t <- function(mean, sd, n, mean2,sd2,n2,pooled=TRUE) { #all functions start and finish with {} if(missing(mean2)) {t <- mean / (sd/sqrt(n)) #the case of just one mean df <- n-1 } else { #the normal case if(pooled ) { pooled <- (sd^2 * (n-1) + sd2^2*(n2-1))/(n+n2-2) se <- sqrt(pooled/n + pooled/n2)} else {se <- sqrt(sd^2/n + sd2^2/n2) } t <- (mean - mean2)/se df <- n + n2 -2 } cat("\n t = ",t, "df =", df) } #the function ends here # The function also should return the probability of the t ```{r} means2t <- function(mean, sd, n, mean2,sd2,n2,pooled=TRUE) { #all functions start and finish with {} if(missing(mean2)) {t <- mean / (sd/sqrt(n)) #the case of just one mean df <- n-1 } else { #the normal case if(pooled ) { pooled <- (sd^2 * (n-1) + sd2^2*(n2-1))/(n+n2-2) se <- sqrt(pooled/n + pooled/n2)} else {se <- sqrt(sd^2/n + sd2^2/n2) } t <- (mean - mean2)/se df <- n + n2 -2 } prob1 <- 2*pt(abs(t),df,lower=FALSE) result <- list(t=t,df=df,prob1=prob1) cat("\n t = ",t, "df =", df, " with probability = ",prob1,"\n") invisible(result) #return the values as well } ```