Probably the most common use of R is just to do a set of analyses using a number of different functions.
For instance, in analysing the data from a psychology experiment in psych 205, I prepared a few scripts that students could use. Similarly, to teach personality research, I could write out the scripts for simple scale construction.
In ouline form these all follow the same concept
get your data (using read.file or read.clipboard)
describe the data (in terms of describe and perhaps pairs.panels)
If forming scales, then create a set of scoring keys and then use scoreItems.
If just one scale is needed, the alpha fuction will produce scores as well.
Proceed to basic data analysis.
Even as you are collecting data for a study you are running, you can write the scripts ahead of time and test them out. That way, when your data are all collected, you are ready to go.
Here I consider the map
function from the package
maps
. Built into the package are the commands to draw maps
of the world, the US, or the US by state and by county.
### Lets just try the example from the ?map help menu
You must first install the maps
package. Use the tools
menu or enter the command install.packages('maps')
. At that
point you need to restart R studio
library(maps) #requires having the maps package
map('usa') # national boundaries
map('county', 'new jersey') # county map of New Jersey
map('state', region = c('new york', 'new jersey', 'penn')) # map of three states
map("state", ".*dakota", myborder = 0) # map of the dakotas
map.axes() # show the effect of myborder = 0
Here is an example of 2009 unemployment statistics
data(unemp)
data(county.fips)
# define color buckets
colors = c("#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043")
unemp$colorBuckets <- as.numeric(cut(unemp$unemp, c(0, 2, 4, 6, 8, 10, 100)))
leg.txt <- c("<2%", "2-4%", "4-6%", "6-8%", "8-10%", ">10%")
# align data with map definitions by (partial) matching state,county
# names, which include multiple polygons for some counties
cnty.fips <- county.fips$fips[match(map("county", plot=FALSE)$names,
county.fips$polyname)]
colorsmatched <- unemp$colorBuckets [match(cnty.fips, unemp$fips)]
# draw map
map("county", col = colors[colorsmatched], fill = TRUE, resolution = 0,
lty = 0, projection = "polyconic")
map("state", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 0.2,
projection="polyconic")
title("unemployment by county, 2009")
legend("topright", leg.txt, horiz = TRUE, fill = colors)
The basic help file gives a nice example, which we can then modify to
examine something more timely. Lets examine COVID-19 cases by county. To
do this, we take the example from the map
help file and
modify if to treat our data.
require(maps) # you need to install this first
library(psych)
library(psychTools)
#adapted from the help file for the map function but set up to do more current data
map.data <- function(data,var="cases",select.date="2020-10-19", n=10, by.state=TRUE, fips = "fips",state=NULL) {
require(maps) #make sure that we have the maps package installed and loaded
gr <- colorRampPalette(c("red", "blue"))
colramp <- gr(n) #set the color range
#get the data
if(!is.null(select.date)) {
x <- subset(data,date ==select.date) #the name of the data file we are using
} else {x <- data}
qx <- quantile(x[,var],probs=(seq(0,1,1/n)),na.rm=TRUE) #figure out the quantiles for the data
if(by.state){
values <- x[match(state.fips$fips,x$fips),var]
} else {
values <- x[match(county.fips$fips,x$fips),var]
}
values <- as.numeric(cut(values,qx,include.lowest=TRUE)) #convert to quantiles
if(!by.state){map("county",state, col = colramp[values], fill = TRUE, resolution = 0,
lty = 1, projection = "polyconic", main=paste(var), border="gray", lwd=0.3)
map("state", state, col = "black", fill = FALSE, add = TRUE, lty = 1, lwd = 1.0, projection="polyconic")
title(main=c(paste(var,"by county on "),select.date)
)
} else {
map("state",state, col = colramp[values], fill = TRUE, resolution = 0,
lty = 0, projection = "polyconic")
title(main=c(paste(var,"by state on "), select.date))
}
}
fn <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"
state.data <- read.file(fn)
## Data from the .csv file https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv has been loaded.
dim(state.data)
## [1] 61942 5
map.data(state.data) #using the default values
Clearly, the more populated states have more cases. We need to correct the state wide data by state population to get a rate per 100,000. This will require getting population by state. We can get this from the unemp data set in the maps package.
We can also plot at the county level, if we can get the data. The NY Times has the data, but part of the file is corrupted. I put a cleaned version on the class server.
https://personality-project.org/courses/350/covid.county.csv
fn <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
county <- read.file(fn)
## Data from the .csv file https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv has been loaded.
map.data(county,by.state=FALSE,select.date= "2020-10-19")
data(unemp)
dim(unemp)
## [1] 3218 3
describe(unemp)
## vars n mean sd median trimmed mad min max range skew kurtosis
## fips 1 3218 31423.17 16282.03 30028.0 31379.13 17933.53 1001.0 72153.0 71152.0 0.16 -0.63
## pop 2 3218 48146.95 155721.05 12211.0 20008.35 12943.84 52.0 4923821.0 4923769.0 14.41 348.17
## unemp 3 3218 8.99 3.64 8.5 8.70 3.26 1.2 30.1 28.9 1.03 2.07
## se
## fips 287.02
## pop 2745.08
## unemp 0.06
map.data(unemp,var="pop",select.date=NULL,by.state=FALSE)
#now do it for unemployment
map.data(unemp,var="unemp",select.date=NULL,by.state=FALSE)
## Use the population data by county to correct the COVID plot
In the previous description file, we saw that one of the variables is
FIPS
.
FIPS are ‘Federal Information Processing Standards’. ’Federal Information Processing Standards (FIPS), now known as Federal Information Processing Series, are numeric codes assigned by the National Institute of Standards and Technology (NIST). Typically, FIPS codes deal with US states and counties. US states are identified by a 2-digit number, while US counties are identified by a 3-digit number. For example, a FIPS code of 06071, represents California -06 and San Bernardino County -071.
We need to merge
the county level covid data with the
county level population data. We do the merge by matching on the FIPS
codes and using the merge
function. See the help page for
merge
for many options.
Merging of data frames is a very important operation. Here we merge
population values for 3,218 counties into a data frame that has 250K
observations. The merge column is the fips
code.
dim(county)
## [1] 2502832 6
dim(unemp)
## [1] 3218 3
colnames(county)
## [1] "date" "county" "state" "fips" "cases" "deaths"
colnames(unemp)
## [1] "fips" "pop" "unemp"
county.pop <- merge(county,unemp,by="fips")
dim(county.pop)
## [1] 2469233 8
colnames(county.pop)
## [1] "fips" "date" "county" "state" "cases" "deaths" "pop" "unemp"
county.pop$casesbypop <- county.pop$cases/county.pop$pop
dim(county.pop)
## [1] 2469233 9
#examine the merged result
headTail(county)
## date county state fips cases deaths
## 1 2020-01-21 Snohomish Washington 53061 1 0
## 2 2020-01-22 Snohomish Washington 53061 1 0
## 3 2020-01-23 Snohomish Washington 53061 1 0
## 4 2020-01-24 Cook Illinois 17031 1 0
## ... <NA> <NA> <NA> ... ... ...
## 2502829 2022-05-13 Teton Wyoming 56039 10074 16
## 2502830 2022-05-13 Uinta Wyoming 56041 5643 39
## 2502831 2022-05-13 Washakie Wyoming 56043 2358 44
## 2502832 2022-05-13 Weston Wyoming 56045 1588 18
headTail(unemp)
## fips pop unemp
## 1 1001 23288 9.7
## 2 1003 81706 9.1
## 3 1005 9703 13.4
## 4 1007 8475 12.1
## ... ... ... ...
## 3215 72147 3053 27.7
## 3216 72149 9141 19.8
## 3217 72151 11066 24.1
## 3218 72153 16275 16
headTail(county.pop)
## fips date county state cases deaths pop unemp casesbypop
## 1 1001 2021-12-20 Autauga Alabama 10679 159 23288 9.7 0.46
## 2 1001 2021-01-08 Autauga Alabama 4770 50 23288 9.7 0.2
## 3 1001 2020-05-06 Autauga Alabama 58 3 23288 9.7 0
## 4 1001 2021-05-16 Autauga Alabama 7005 108 23288 9.7 0.3
## ... ... <NA> <NA> <NA> ... ... ... ... ...
## 2469230 72153 2022-03-26 Yauco Puerto Rico 4207 <NA> 16275 16 0.26
## 2469231 72153 2021-07-20 Yauco Puerto Rico 1271 <NA> 16275 16 0.08
## 2469232 72153 2021-05-03 Yauco Puerto Rico 1192 <NA> 16275 16 0.07
## 2469233 72153 2022-04-04 Yauco Puerto Rico 4232 <NA> 16275 16 0.26
We now used the merged data to plot population adjust trends.
map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= "2021-10-19")
The nice thing about functions, instead of scripts, is that we can easily adjust them to make multiple calls. Here we show rates for multiple months.
dates <- c("2020-02-01","2020-03-01","2020-04-01","2020-05-01","2020-06-01","2020-08-01","2020-10-01","2020-12-01","2021-02-01","2021-04-01","2021-06-01","2021-08-01","2021-10-01","2021-12-01","2022-02-01","2022-04-01")
for (i in 1:length(dates) ) {
map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= dates[i])
}
dates <- c("2020-02-01","2020-03-01","2020-03-08","2020-03-15","2020-03-22","2020-03-29","2020-04-01","2020-04-15","2020-05-01","2020-06-01","2020-08-01","2020-10-01","2020-12-01","2022-04-01")
for (i in 1:length(dates) ) {
map.data(county.pop,var="casesbypop",by.state=FALSE,select.date= dates[i])
}