--- title: "350:week 2 correlation" author: "William Revelle" date: "4/1/2024" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=100) ``` ## Correlation using the Galton data set Make the 'psych' and 'psychTools' packages active ```{r startup} library(psych) library(psychTools) sessionInfo() ``` ## Get the data We can get the 'Galton' data set by just calling it by name. It is a built in data set. It is important to find the dimensions of the data and perhaps to describe the data. ```{r galton} dim(galton) #what are the dimensions names(galton) #what are the variable names describe(galton) #basic descriptives ``` ## Tabulate the data First form the table using 'table', then sort it using the 'order' function ```{r tabulate} galton.tab <- table(galton) galton.tab #this table is ordered from short parents to tall parents rownames(galton.tab) rank(rownames(galton.tab)) #string commands together ord <- order(rank(rownames(galton.tab)),decreasing=TRUE) ord galton.tab[ord,] #this table is now orderd from tall parents to short parents ``` ```{r } plot(galton) ``` That plot is not very helpful, because it does not show how many people are at each point. Let use the 'jitter' command. First set the random seed to a set value so the figures will agree ```{r} set.seed(42) #cite Adams, 1979 plot(galton,pch=20,col="blue") #show the original data points points(jitter(galton[,1]),jitter(galton[,2])) #add a little jitter #note we used 'points' to add to the plot set.seed(42) #cite Adams, 1979 plot(galton,pch=20,col="blue") #show the original data points points(jitter(galton[,1],2),jitter(galton[,2],2)) #add a little more jitter set.seed(42) #cite Adams, 1979 plot(galton,pch=20,col="blue") #show the original data points points(jitter(galton[,1],5),jitter(galton[,2],5)) #add a little more jitter ``` ### We can also display the means and error bars ```{r} error.bars.by(child ~ parent,data=galton,eyes=FALSE,v.labels=63:73,main="Galton's Height data") scatterHist(child ~ parent,data=galton ) #normal formula imput scatterHist(jitter(galton$child,5) ,jitter(galton$parent,5),ylab="Child",xlab="Parent",main="Galton height data") #but if the variables jittered, you need this alternative style ``` #Yet one more display -- the 'pairs.panels' function ```{r} pairs.panels(galton) #but "jiggle" aka 'jitter" the points pairs.panels(galton,jiggle=TRUE) ``` ## create a function ```{r} # default values may be specified small <- function(data=NULL,sample.size=20, n.iter=1000) { nsub <- nrow(data) #this figures out the sampe size dynamically result <- rep(NA,n.iter) #create this vector #use a for loop to repeat the code inside the { } for(i in 1:n.iter) { #repeat some code samp <- sample(nsub,sample.size,replace=TRUE) #boot strap resampling result[i] <- cor(data[samp,])[1,2] #find the correlation for this sample and save it } #end of the loop return(result) #return the value we find } #end of function ``` Use this function to generate some data ```{r} test <- small(galton) #this uses the default valuex describe(test) hist(test,breaks=21) #draw a histogram of the results ``` ## now do it for a bunch of cases ```{r} samp20 <- small(galton,20) samp40 <- small(galton,40) samp80 <- small(galton,80) samp160 <- small(galton,160) samp320 <- small(galton,320) samp640<- small(galton,640) sample.df <- data.frame(samp20 ,samp40, samp80, samp160, samp320,samp640) describe(sample.df) ``` ## Now, we try showing these results We show them several different ways and slowly make the figure better. ```{r violin} violin(sample.df) ``` Add in error bars to the violin plot ```{r} violin(sample.df) error.bars(sample.df,add=TRUE) ``` Just show the error bars. Note that the current version of `psych' just shows 3 colors by default. This has been fixed in the most recent version. (Comimg soon) ```{r} error.bars(sample.df) #this only has 3 colors error.bars(sample.df, col=rainbow(6)) #six colors #combine the violin and error bars violin(sample.df, main="Density of boot strap resamples from Galton") error.bars(sample.df, col=rainbow(ncol(sample.df)),add=TRUE) #six colors ```