--- title: "350:Week 2c" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=120) #This sets the width of the output, 80 seems to be the default and is too narrow ``` #Updating packages and Graphic presentations R packages are always under development. The version on CRAN is 2.3.3 The development version on my server is `psych` ( 2.3.4. Rather than push the updated version to CRAN, and to reduce the demands on the CRAN system, it is better to save small changes on local 'repositories' which have not gone through all the overhead of CRAN. The local repository for the psych package is at http://personality-project.org/r . To install originally from this repository we use the `install.packages` command. Or, we can check the repository to see if we need to update our package using the `update.packages` command. \newpage ```{r} #install.packages("psych", repos = "http://personality-project.org/r", type = "source") # or update.packages("psych", repos = "http://personality-project.org/r", type = "source") # if we already have it does not need to download a new version. ``` Lets make sure we succeeded. Check the version number of the package. ```{r} library(psych) #remember to make this active library(psychTools) #needed for some of the data sets and convenience functions sessionInfo() #see what version we have (psych should be 2.3.4) ``` \newpage ### Seeing what has changed. To see what has been added to a package you can read the `news file` for that package. You can read all of the news for a package, or select just the most recent. Here we read the news for the most recent release of `psych`. This is a fair amount of output which you might want to not bother to look at more than once. Thus, you can comment it out if you are making successive runs. ```{r} #news(Version> "2.3.3",package="psych") #this can be commented out ``` \newpage # Graphic displays One of the great powers of R is in its ability to graphically display data. The human eye is a very good pattern detector and interactive graphics are useful for exploratory data analysis. Here we show a number of ways to display data, some use Core-r, some use the `psych` package. We will use the `spi` data set for our examples. But, because the age scores (spi[1]) are so much bigger than the others, we just do the 2nd to 10th variables. The `spi` data set is an example of the SAPA project (Synthetic Aperture Personality Assessment) which uses a web based data collection system to collect massively missing data sets. Each participant is given a random subset of temperament, ability and interest items taken from a very large (> 6,000 items) item pool. From this larger set of items, David Condon has formed a shorter set of 135 items (the SPI or SAPA Personalilty Inventory) that do a good job of spanning the major dimensions of personality. We will use this set for examples of graphics (below) as well as scale construction (in a few weeks). \newpage ## First a boxplot We will draw a Tukey BoxPlot for variables 2-10 of the spi data set. We add a main title to the graph. Note that we do not set the `echo = FALSE` parameter because we want show the R code that generated the plot. ```{r} data(spi) boxplot(spi[2:10],main="A box Plot from the spi data set",las=3) ``` \newpage A box plot does an important job of summarizing the range of variation, as well as showing the medians and the interquartile range. However, we do not really see the distribution of the scores. To do this, we do a "violin" plot using `violinBy`. ### Create a violin plot to show the density of the data. ```{r} violinBy(spi[2:10],main="A violin plot from the spi data set") ``` \newpage ## Does subject gender make a difference? We can do the violin plot again, this time breaking it down by subject gender. We specify the orientation of the x and y axis labels as being perpendicular to the axis by specifying las=2 as a graphics parameter. This introduces the concept of parameter setting to functions. As we discussed last week, these parameters may thought of as adverbs, in that they are modifying our verb (in this case violinBy). Examine the parameters (for a ful list, ask for help for the function) ```{r} violinBy(spi[2:10],grp="sex",las=2, main="A violin plot from the spi data set, broken down by gender") ``` We see that there is a weakness in the 'violinBy' function, in that you really want to have the variable names as well as the group names. After exploring the function, I found that this is possible by specifying another paramter. ```{r} violinBy(spi[2:10],grp="sex",grp.name="sex",las=2, main="A violin plot from the spi data set, broken down by gender") ``` \newpage ## Back to back histograms allow us to compare possible differences by gender We use the `bi.bars` function and then add additional text to the figure using the `text` function. The `text` function adds text to the x, y locations in a graph. First you draw the graph, then add the text. Although showing as positive numbers, the locations for the male data are actually negative values. ```{r} #The data come from the spi data set in psychTools bi.bars(spi,"age",grp = "sex", main="Back to back histograms of age by sex") text(-150,70,'Male') text(150,70,'Female') ``` \newpage ## What about the correlations between these demographic variables? We can examine the bivariate scatter plots of the demographic variables from the spi data set using the `pairs.panels` function. This is normally done for more continuous data, but it will show patterns of relationships for these categorical data. A prelinary view of the data suggests that we should plot age, p1edu, p2edu, and education against each other. Thus, we select those variables to plot. Because we have 4000 data points, we will smooth the data so that we can see the results more clearly. By default, `pairs.panels` does not smooth the data. We will show it both ways. ```{r} pairs.panels(spi[c(1,4:6)],main= "Age, parental education, and education") pairs.panels(spi[c(1,4:6)],smoother=TRUE,main= "Age, parental education, and education") ``` The red lines in each graph are "lowess" (locally smoothed) regressions. These give a better feeling for the results. Examining these correlations we find that the education of parents tend to correlate, and that education is highly correlated with age. Does this mean that education makes you older? No. \newpage # Confidence intervals for the mean Any data set will have variation. The `box plot` and `violinBy` plots are showing this variation. What if we took another set of observations. How much will the mean vary from sample to sample? Just as a sample has a mean and a variance, so do samples of samples. Although the underlying distribution for any variable might not be normal, the distribution of means of these samples will tend towards normality. This is known as the "central limit theorem". The distribution of these sample means will vary by the standard error of the means where $se_{x}= \frac{\sigma_{x}}{\sqrt{n}}$ The help pages for the `error.bars` function show this effect. We create 20 random samples of size 50 taken from a normal distribution. \newpage ```{r} set.seed(42) #to get the same result every time x <- replicate(20,rnorm(50)) #create 20 sets of 50 random normal numbers boxplot(x,notch=TRUE,main="Notched boxplot with error bars") error.bars(x,add=TRUE) #overlay the box plots with standard error plots abline(h=0) #put in a horizontal line at the expected value (0). ``` When we examine this plot, we notice that 19 of the 20 groups have 95\% confidence intervals that include the population value (0). However one of the groups (18) has confidence intervals that do not include 0. This is indeed what we would expect: The 95\% confidence intervals of the sample means do not overlap the population value 5\% of the time. \newpage Try this again, what happens? Use a different seed. ```{r} set.seed(17) #to get the same result every time x <- replicate(20,rnorm(50)) #create 20 sets of 50 random normal numbers boxplot(x,notch=TRUE,main="Notched boxplot with error bars") error.bars(x,add=TRUE) #overlay the box plots with standard error plots abline(h=0) #put in a horizontal line at the expected value (0). ``` One more time, but this time, do not draw the box plots. We want to scale the y axis from -1 to 1 and add a label for the x and y axes. Notice that once again, 19 out of 20 confidence intervals include the true value (0). ```{r} set.seed(47) #to get the same result every time x <- replicate(20,rnorm(50)) #create 20 sets of 50 random normal numbers #boxplot(x,notch=TRUE,main="Notched boxplot with error bars") error.bars(x,ylim=c(-1,1),xlab="Random Groups",ylab="Means and confidence intervals",main="95% confidence intervals with cats eyes") abline(h=0) #put in a horizontal line at the expected value (0). ``` ```{r} set.seed(71) #to get the same result every time x <- replicate(20,rnorm(50)) #create 20 sets of 50 random normal numbers #boxplot(x,notch=TRUE,main="Notched boxplot with error bars") error.bars(x,ylim=c(-1,1),xlab="Random Groups",ylab="Means and confidence intervals",main="95% confidence intervals without cats eyes", eyes=FALSE) abline(h=0) #put in a horizontal line at the expected value (0). ``` ## Confidence intervals for correlations Just as sample means differ from the population mean, so will sample correlations differ from the true population value. Unless we are doing simulations we do not know the population values, but we can still find the confidence intervals for the correlations. The basic logic of confidence intervals is to find the estimate (in this case the correlation) and then the confidence interval of the estimate. Normal theory tells us that the standard error of a correlation is $\sqrt{\frac{1-r^2}{N-2}}$. All of this is done in the `corr.test` function. (`cor.test` in core-R will do one correlation at a time, `corr.test` in the `psych` package will do many correlations at once). Lets use the `epi.bfi` data set which has 231 observations on 11 variables. First describe them, then show all the correlations using pairs.panels, then find one correlation using `cor.test`, and then find the correlations of the first 5 variables using `corr.test`. We set an object (result) = the results of corr.test because we want to print it several different ways. \newpage ```{r} describe(epi.bfi) #first get the descriptives pairs.panels(epi.bfi,gap=0) #a graphical display, bunch the plots up cor.test(x = epi.bfi[,1],y=epi.bfi[,2]) result <- corr.test(epi.bfi[1:5]) result #this just prints the result ``` \newpage The correlation "p values" are given below the diagonal. But because we are doing many correlations, we should adjust these for how many correlations we are examing. Many people use the "Bonferroni" adjustment, which is basically the p value / number of correlations. A more powerful adjustment is the "Holm" correction which rank orders the p value from smallest to largest, and then does a "Bonferroni" correction on the smallest, and then adjusts the remaining p values according to how many are left. See the help menu for `p.adjust` which explains this. As is true of most R functions, the basic (default) output hides many other objects. We can find the names of these objects by asking for names. ```{r} names(result) ``` In addition, many functions will give slightly longer output if you ask for it. Here we get the confidence intervals for each correlation, as well as the 'adjusted' correlations. These have been adjusted for the fact that we did multiple comparisons. ```{r} print(result,short=FALSE) ``` ### Showing the confidence intervals (normal theory) We can create a corPlot of the correlations, showing the confidence ranges: ```{r} corPlotUpperLowerCi(result) ``` We have (by default) scaled the correlations by their "significance". ## Confidence intervals using the bootstrap Perhaps one of the most important contributions to modern statistics is the concept of the "bootstrap". This allows us to do tests of confidence intervals even if the underlying distributions are not normal. The basic logic is to resample (with replacement) from our observed data. This treats our initial data set as the population, and then takes many (e.g, 100-10,000) random resamples of the data. We then consider the distribution of these resamples. The bootstrap was developed by Bradley Efron as a generalization of John Tukey's 'Jack knife'. Tukey thought it should be called the shotgun because it blew up the data!. The bootstrap may be used to find the distribution of any parameter and does not require normal theory. We will use it again when we look at the `mediate` function, as well as the confidence intervals of factor loadings. The `cor.ci` function will do the bootstrap for correlations. By default, it also draws a heat map of the correlations, scaling the correlations by "significance". We can also draw this plot with the lower and upper bounds of the correlations using the `corPlotUpperLowerCi` function with the output from the `cor.ci` function. ```{r} boot.result <- cor.ci(epi.bfi[1:5],n.iter=1000) #the same data as before boot.result #show the results names(boot.result) #what else is there? describe(boot.result$replicates) multi.hist(boot.result$replicates[,1:9]) #show a histogram of some of the bootstrapped correlations corPlotUpperLowerCi(boot.result) ``` Compare the empirical confidence regions (lower.emp and upper.emp) with the normal theory estimates (lower.norm and upper.norm), they are very similar, but not identical. We will see the use of the bootstrap in many of the analyses we will do this quarter. It is a powerful use of cross validation. # Homework for week 2: (called 1a in the canvas assignements) Using your own data, or a data set choosen from the class data sets, redo these analyses. To find example data sets, using your browser, go to \href{https://personality-project.org/courses/350/datasets}{https://personality-project.org/courses/350/datasets} and then add tha name of the file to load it e.g. ```{r} fn <- "https://personality-project.org/courses/350/datasets/hp.csv" example <- read.file(fn) ```