Scripting procedures to do basic analysis

Probably the most common use of R is just to do a set of analyses using a number of different functions.

For instance, in analysing the data from a psychology experiment in psych 205, I prepared a few scripts that students could use. Similarly, to teach personality research, I could write out the scripts for simple scale construction.

In ouline form these all follow the same concept

get your data (using read.file or read.clipboard)

describe the data (in terms of describe and perhaps pairs.panels)

If forming scales, then create a set of scoring keys and then use scoreItems.

If just one scale is needed, the alpha fuction will produce scores as well.

Proceed to basic data analysis.

When working on your own project, you could script the work you are planning on doing

Even as you are collecting data for a study you are running, you can write the scripts ahead of time and test them out. That way, when your data are all collected, you are ready to go.

Another procedure is to modify a known function to do what you want to do.

Here I consider the map function from the package maps. Built into the package are the commands to draw maps of the world, the US, or the US by state and by county.
### Lets just try the example from the ?map help menu

You must first install the maps package. Use the tools menu or enter the command install.packages('maps'). At that point you need to restart R studio

library(maps)   #requires having the maps package
map('usa')  # national boundaries

map('county', 'new jersey') # county map of New Jersey

map('state', region = c('new york', 'new jersey', 'penn'))  # map of three states

map("state", ".*dakota", myborder = 0)  # map of the dakotas
map.axes()              # show the effect of myborder = 0

Here is an example of 2009 unemployment statistics

 data(unemp)
  data(county.fips)

  # define color buckets
  colors = c("#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043")
  unemp$colorBuckets <- as.numeric(cut(unemp$unemp, c(0, 2, 4, 6, 8, 10, 100)))
  leg.txt <- c("<2%", "2-4%", "4-6%", "6-8%", "8-10%", ">10%")

  # align data with map definitions by (partial) matching state,county
  # names, which include multiple polygons for some counties
  cnty.fips <- county.fips$fips[match(map("county", plot=FALSE)$names,
    county.fips$polyname)]
  colorsmatched <- unemp$colorBuckets [match(cnty.fips, unemp$fips)]

  # draw map
  map("county", col = colors[colorsmatched], fill = TRUE, resolution = 0,
    lty = 0, projection = "polyconic")
  map("state", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 0.2,
    projection="polyconic")
  title("unemployment by county, 2009")
  legend("topright", leg.txt, horiz = TRUE, fill = colors)

The basic help file gives a nice example, which we can then modify to examine something more timely. Lets examine COVID-19 cases by county. To do this, we take the example from the map help file and modify if to treat our data.

First, a small function to do the graphic by state or county

require(maps)  # you need to install this first
library(psych)
library(psychTools)
#adapted from the help file for the map function but set up to do more current data

map.data <- function(data,var="cases",select.date="2020-10-19", n=10, by.state=TRUE, fips = "fips",state=NULL) {
require(maps)    #make sure that we have the maps package installed and loaded
 gr <- colorRampPalette(c("red", "blue"))
 colramp <- gr(n)   #set the color range
 #get the data
 if(!is.null(select.date)) {
  x <- subset(data,date ==select.date) #the name of the data file we are using
  } else {x <- data}
  
 qx <- quantile(x[,var],probs=(seq(0,1,1/n)),na.rm=TRUE) #figure out the quantiles for the data
 
  if(by.state){
     values <- x[match(state.fips$fips,x$fips),var]
    }  else {
    values <- x[match(county.fips$fips,x$fips),var]  
 }

values <- as.numeric(cut(values,qx,include.lowest=TRUE)) #convert to quantiles

  if(!by.state){map("county",state, col = colramp[values], fill = TRUE, resolution = 0,
     lty = 1, projection = "polyconic", main=paste(var), border="gray", lwd=0.3)
      map("state", state, col = "black", fill = FALSE, add = TRUE, lty = 1, lwd = 1.0, projection="polyconic")
      title(main=c(paste(var,"by county on "),select.date)     
      )
} else {
  map("state",state, col = colramp[values], fill = TRUE, resolution = 0,
     lty = 0, projection = "polyconic")
     title(main=c(paste(var,"by state on "), select.date))
     }
}

Try running this function with some data downloaded from the New York Times.

fn <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"

state.data <- read.file(fn)
## Data from the .csv file https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv has been loaded.
dim(state.data)
## [1] 61942     5
map.data(state.data)  #using the default values

How to improve this plot?

Clearly, the more populated states have more cases. We need to correct the state wide data by state population to get a rate per 100,000. This will require getting population by state. We can get this from the unemp data set in the maps package.

We can also plot at the county level, if we can get the data. The NY Times has the data, but part of the file is corrupted. I put a cleaned version on the class server.

https://personality-project.org/courses/350/covid.county.csv

fn <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
county <- read.file(fn)
## Data from the .csv file https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv has been loaded.
map.data(county,by.state=FALSE,select.date= "2020-10-19")

An example of population by county from the help page

data(unemp)
dim(unemp)
## [1] 3218    3
describe(unemp)
##       vars    n     mean        sd  median  trimmed      mad    min       max     range  skew kurtosis
## fips     1 3218 31423.17  16282.03 30028.0 31379.13 17933.53 1001.0   72153.0   71152.0  0.16    -0.63
## pop      2 3218 48146.95 155721.05 12211.0 20008.35 12943.84   52.0 4923821.0 4923769.0 14.41   348.17
## unemp    3 3218     8.99      3.64     8.5     8.70     3.26    1.2      30.1      28.9  1.03     2.07
##            se
## fips   287.02
## pop   2745.08
## unemp    0.06
map.data(unemp,var="pop",select.date=NULL,by.state=FALSE)

#now do it for unemployment
map.data(unemp,var="unemp",select.date=NULL,by.state=FALSE)

## Use the population data by county to correct the COVID plot

In the previous description file, we saw that one of the variables is FIPS.

FIPS are ‘Federal Information Processing Standards’. ’Federal Information Processing Standards (FIPS), now known as Federal Information Processing Series, are numeric codes assigned by the National Institute of Standards and Technology (NIST). Typically, FIPS codes deal with US states and counties. US states are identified by a 2-digit number, while US counties are identified by a 3-digit number. For example, a FIPS code of 06071, represents California -06 and San Bernardino County -071.

We need to merge the county level covid data with the county level population data. We do the merge by matching on the FIPS codes and using the merge function. See the help page for merge for many options.

Merging of data frames is a very important operation. Here we merge population values for 3,218 counties into a data frame that has 250K observations. The merge column is the fips code.

dim(county)
## [1] 2502832       6
dim(unemp)
## [1] 3218    3
colnames(county)
## [1] "date"   "county" "state"  "fips"   "cases"  "deaths"
colnames(unemp)
## [1] "fips"  "pop"   "unemp"
county.pop <- merge(county,unemp,by="fips")
dim(county.pop)
## [1] 2469233       8
colnames(county.pop)
## [1] "fips"   "date"   "county" "state"  "cases"  "deaths" "pop"    "unemp"
county.pop$casesbypop <- county.pop$cases/county.pop$pop
dim(county.pop)
## [1] 2469233       9
#examine the merged result
headTail(county)
##               date    county      state  fips cases deaths
## 1       2020-01-21 Snohomish Washington 53061     1      0
## 2       2020-01-22 Snohomish Washington 53061     1      0
## 3       2020-01-23 Snohomish Washington 53061     1      0
## 4       2020-01-24      Cook   Illinois 17031     1      0
## ...           <NA>      <NA>       <NA>   ...   ...    ...
## 2502829 2022-05-13     Teton    Wyoming 56039 10074     16
## 2502830 2022-05-13     Uinta    Wyoming 56041  5643     39
## 2502831 2022-05-13  Washakie    Wyoming 56043  2358     44
## 2502832 2022-05-13    Weston    Wyoming 56045  1588     18
headTail(unemp)
##       fips   pop unemp
## 1     1001 23288   9.7
## 2     1003 81706   9.1
## 3     1005  9703  13.4
## 4     1007  8475  12.1
## ...    ...   ...   ...
## 3215 72147  3053  27.7
## 3216 72149  9141  19.8
## 3217 72151 11066  24.1
## 3218 72153 16275    16
headTail(county.pop)
##          fips       date  county       state cases deaths   pop unemp casesbypop
## 1        1001 2021-12-20 Autauga     Alabama 10679    159 23288   9.7       0.46
## 2        1001 2021-01-08 Autauga     Alabama  4770     50 23288   9.7        0.2
## 3        1001 2020-05-06 Autauga     Alabama    58      3 23288   9.7          0
## 4        1001 2021-05-16 Autauga     Alabama  7005    108 23288   9.7        0.3
## ...       ...       <NA>    <NA>        <NA>   ...    ...   ...   ...        ...
## 2469230 72153 2022-03-26   Yauco Puerto Rico  4207   <NA> 16275    16       0.26
## 2469231 72153 2021-07-20   Yauco Puerto Rico  1271   <NA> 16275    16       0.08
## 2469232 72153 2021-05-03   Yauco Puerto Rico  1192   <NA> 16275    16       0.07
## 2469233 72153 2022-04-04   Yauco Puerto Rico  4232   <NA> 16275    16       0.26

We now used the merged data to plot population adjust trends.

map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= "2021-10-19")

Do it again, but with finer resolution for the first few months

dates <- c("2020-02-01","2020-03-01","2020-03-08","2020-03-15","2020-03-22","2020-03-29","2020-04-01","2020-04-15","2020-05-01","2020-06-01","2020-08-01","2020-10-01","2020-12-01","2022-04-01")
for (i in 1:length(dates) ) {
map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= dates[i])
}

One final tweak - show just Illinois

for (i in 1:length(dates) ) {
map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= dates[i], state="Illinois")
}

Steps towards writing a function

Although one can write little scripts to do processes, it is better to write (and save) short functions to do operations that you find useful. Here we consider the development and testing of a function. (Although based upon the m2t function, we will go through the steps of development, and then add some improvements.)

Identify a need

Just write functions that don’t already exist and that do things that you find useful.

Finding t-tests values from means, standard deviations, and sample sizes.

The conventional t.test function will perform t-tests on data and then reports the mean. But sometimes we have reports of results where we are told the mean(s) and the standard deviation(s) and we want to calculate the t statistic.

William Gosset, publishing under the name `student’ developed test for the likelyhood of a particular mean given the hypothesis of a mean of zero.

\[t = \frac{\bar{x}}{se_x}\]

where the standard error of x, \(se_x\), varied by sample size and the observed standard deviation, \(sd_x\).

\[se_x = sd_x/\sqrt{(n)}\]

Can we write a function to find t given \(\bar{x}\), \(sd_x\) and n.

Writing the function

First name the function and specify what inputs it will take.

Make sure to add comments so that you will know what the function is doing.

#A function to find the t statistic given a mean, sd and sample size
mean2t <- function(mean, sd, n) {  #all functions start and finish with {}
t <- mean / (sd/sqrt(n))
print(t)
}

Test the function against other functions. We will see if we get the same values as we get from the t.test function on the Cushny data set. We first will describe the Cushny data set and then test several values.

library(psych)
library(psychTools)
describe(cushny)
##         vars  n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## Control    1 10 3.25 1.78   2.95    3.21 1.63  0.6 6.2   5.6  0.20    -1.23 0.56
## drug1      2 10 4.00 2.10   4.40    4.04 2.59  1.1 6.6   5.5 -0.25    -1.67 0.66
## drug2L     3 10 5.58 1.66   5.75    5.66 1.41  2.5 8.0   5.5 -0.30    -0.99 0.53
## drug2R     4 10 5.57 1.91   5.30    5.66 1.56  2.1 8.3   6.2 -0.09    -1.07 0.60
## delta1     5 10 0.75 1.79   0.35    0.67 1.56 -1.6 3.7   5.3  0.42    -1.30 0.57
## delta2L    6 10 2.33 2.00   1.75    2.24 2.45 -0.1 5.5   5.6  0.28    -1.66 0.63
## delta2R    7 10 2.32 2.27   1.50    2.28 2.59 -0.7 5.7   6.4  0.23    -1.68 0.72
print(describe(cushny),digits=4) #we need more accuracy
##         vars  n mean     sd median trimmed    mad  min max range    skew kurtosis     se
## Control    1 10 3.25 1.7784   2.95  3.2125 1.6309  0.6 6.2   5.6  0.1954  -1.2321 0.5624
## drug1      2 10 4.00 2.1024   4.40  4.0375 2.5945  1.1 6.6   5.5 -0.2512  -1.6748 0.6648
## drug2L     3 10 5.58 1.6612   5.75  5.6625 1.4085  2.5 8.0   5.5 -0.2958  -0.9874 0.5253
## drug2R     4 10 5.57 1.9079   5.30  5.6625 1.5567  2.1 8.3   6.2 -0.0861  -1.0689 0.6033
## delta1     5 10 0.75 1.7890   0.35  0.6750 1.5567 -1.6 3.7   5.3  0.4183  -1.3004 0.5657
## delta2L    6 10 2.33 2.0022   1.75  2.2375 2.4463 -0.1 5.5   5.6  0.2778  -1.6630 0.6332
## delta2R    7 10 2.32 2.2661   1.50  2.2750 2.5946 -0.7 5.7   6.4  0.2345  -1.6764 0.7166
t.test(cushny[2])
## 
##  One Sample t-test
## 
## data:  cushny[2]
## t = 6.0166, df = 9, p-value = 0.0001984
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2.496048 5.503952
## sample estimates:
## mean of x 
##         4
mean2t(mean=4.00,sd=2.1024,n=10)
## [1] 6.01651
t.test(cushny[6])
## 
##  One Sample t-test
## 
## data:  cushny[6]
## t = 3.6799, df = 9, p-value = 0.005076
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.8976775 3.7623225
## sample estimates:
## mean of x 
##      2.33
mean2t(mean=2.33,sd= 2.0022,n=10) #try another one
## [1] 3.680005

Improve the function to treat 2 means

#A function to find the t statistic given a mean, sd and sample size

means2t <- function(mean, sd, n, mean2,sd2,n2) {  #all functions start and finish with {}
if(missing(mean2)) {t <- mean / (sd/sqrt(n))  #the case of just one mean
df <- n-1

} else { #the normal case
 pooled <- (sd^2 * n + sd2^2*n2)/(n+n2)
 sepooled <- sqrt(pooled/n + pooled/n2)
 t <- (mean - mean2)/sepooled
 df <- n + n2 -2
}
cat("\n t = ",t, "df =", df,  "\n")  

}

test the function against the sleep data

data(sleep)
with(sleep,
     t.test(extra[group == 1],
            extra[group == 2],equal.var=TRUE))
## 
##  Welch Two Sample t-test
## 
## data:  extra[group == 1] and extra[group == 2]
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean of x mean of y 
##      0.75      2.33
# print(describeBy(sleep,"group"),digits=5) #we need this precision

test the function

means2t(.75,1.78901,10,2.33,2.00225,10) #we add the extra precision to match
## 
##  t =  -1.860813 df = 18

Make it more general

A function to find the t statistic given a mean, sd and sample size

means2t <- function(mean, sd, n, mean2,sd2,n2,pooled=TRUE) {  #all functions start and finish with {}
if(missing(mean2)) {t <- mean / (sd/sqrt(n))  #the case of just one mean
df <- n-1

} else { #the normal case
if(pooled  ) {
 pooled <- (sd^2 * (n-1) + sd2^2*(n2-1))/(n+n2-2)
 se <- sqrt(pooled/n + pooled/n2)} else {se <- sqrt(sd^2/n + sd2^2/n2) }
 t <- (mean - mean2)/se
 df <- n + n2 -2
}
 
}   #the function ends here

Now run the function and compare it to the known value from the t-test function

means2t(.75,1.78901,10,2.33,2.00225,10)

Something is missing

We need to report the answer

means2t <- function(mean, sd, n, mean2,sd2,n2,pooled=TRUE) {  #all functions start and finish with {}
if(missing(mean2)) {t <- mean / (sd/sqrt(n))  #the case of just one mean
df <- n-1

} else { #the normal case
if(pooled  ) {
 pooled <- (sd^2 * (n-1) + sd2^2*(n2-1))/(n+n2-2)
 se <- sqrt(pooled/n + pooled/n2)} else {se <- sqrt(sd^2/n + sd2^2/n2) }
 t <- (mean - mean2)/se
 df <- n + n2 -2
}
cat("\n t = ",t, "df =", df)
}   #the function ends here



# The function also should return the probability of the t
means2t <- function(mean, sd, n, mean2,sd2,n2,pooled=TRUE) {  #all functions start and finish with {}
if(missing(mean2)) {t <- mean / (sd/sqrt(n))  #the case of just one mean
df <- n-1

} else { #the normal case
if(pooled  ) {
 pooled <- (sd^2 * (n-1) + sd2^2*(n2-1))/(n+n2-2)
 se <- sqrt(pooled/n + pooled/n2)} else {se <- sqrt(sd^2/n + sd2^2/n2) }
 t <- (mean - mean2)/se
 df <- n + n2 -2
}
  prob1 <- 2*pt(abs(t),df,lower=FALSE)
result <- list(t=t,df=df,prob1=prob1)
cat("\n t = ",t, "df =", df, " with probability = 
",prob1,"\n")
invisible(result) #return the values as well
}