#Updating packages and Graphic presentations

R packages are always under development. The version on CRAN is 2.3.3 The development version on my server is psych ( 2.3.4. Rather than push the updated version to CRAN, and to reduce the demands on the CRAN system, it is better to save small changes on local ‘repositories’ which have not gone through all the overhead of CRAN. The local repository for the psych package is at http://personality-project.org/r . To install originally from this repository we use the install.packages command. Or, we can check the repository to see if we need to update our package using the update.packages command.

#install.packages("psych", repos = "http://personality-project.org/r", type = "source") # or
update.packages("psych", repos = "http://personality-project.org/r", type = "source") # if we already have it does not need to download a new version.

Lets make sure we succeeded. Check the version number of the package.

library(psych)   #remember to make this active
library(psychTools) #needed for some of the data sets and convenience functions
sessionInfo()   #see what version we have  (psych should be 2.3.4)
## R Under development (unstable) (2023-03-17 r83997)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## 
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] psychTools_2.3.3 psych_2.3.4     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.31   R6_2.5.1        fastmap_1.1.1   xfun_0.38       lattice_0.20-45 cachem_1.0.7    parallel_4.3.0 
##  [8] knitr_1.42      foreign_0.8-84  htmltools_0.5.5 rmarkdown_2.21  cli_3.6.1       grid_4.3.0      sass_0.4.5     
## [15] jquerylib_0.1.4 mnormt_2.1.1    compiler_4.3.0  rstudioapi_0.14 tools_4.3.0     nlme_3.1-162    evaluate_0.20  
## [22] bslib_0.4.2     yaml_2.3.7      rlang_1.1.0     jsonlite_1.8.4

Seeing what has changed.

To see what has been added to a package you can read the news file for that package. You can read all of the news for a package, or select just the most recent. Here we read the news for the most recent release of psych. This is a fair amount of output which you might want to not bother to look at more than once. Thus, you can comment it out if you are making successive runs.

#news(Version> "2.3.3",package="psych") #this can be commented out

Graphic displays

One of the great powers of R is in its ability to graphically display data. The human eye is a very good pattern detector and interactive graphics are useful for exploratory data analysis. Here we show a number of ways to display data, some use Core-r, some use the psych package. We will use the spi data set for our examples. But, because the age scores (spi[1]) are so much bigger than the others, we just do the 2nd to 10th variables.

The spi data set is an example of the SAPA project (Synthetic Aperture Personality Assessment) which uses a web based data collection system to collect massively missing data sets. Each participant is given a random subset of temperament, ability and interest items taken from a very large (> 6,000 items) item pool. From this larger set of items, David Condon has formed a shorter set of 135 items (the SPI or SAPA Personalilty Inventory) that do a good job of spanning the major dimensions of personality. We will use this set for examples of graphics (below) as well as scale construction (in a few weeks).

First a boxplot

We will draw a Tukey BoxPlot for variables 2-10 of the spi data set. We add a main title to the graph. Note that we do not set the echo = FALSE parameter because we want show the R code that generated the plot.

data(spi)
boxplot(spi[2:10],main="A box Plot from the spi data set",las=3)

A box plot does an important job of summarizing the range of variation, as well as showing the medians and the interquartile range. However, we do not really see the distribution of the scores. To do this, we do a “violin” plot using violinBy.

Create a violin plot to show the density of the data.

violinBy(spi[2:10],main="A violin plot from the spi data set")

Does subject gender make a difference?

We can do the violin plot again, this time breaking it down by subject gender. We specify the orientation of the x and y axis labels as being perpendicular to the axis by specifying las=2 as a graphics parameter.

This introduces the concept of parameter setting to functions. As we discussed last week, these parameters may thought of as adverbs, in that they are modifying our verb (in this case violinBy).

Examine the parameters (for a ful list, ask for help for the function)

violinBy(spi[2:10],grp="sex",las=2, main="A violin plot from the spi data set, broken down by gender")

We see that there is a weakness in the ‘violinBy’ function, in that you really want to have the variable names as well as the group names. After exploring the function, I found that this is possible by specifying another paramter.

violinBy(spi[2:10],grp="sex",grp.name="sex",las=2, main="A violin plot from the spi data set, broken down by gender")

Back to back histograms allow us to compare possible differences by gender

We use the bi.bars function and then add additional text to the figure using the text function.

The text function adds text to the x, y locations in a graph. First you draw the graph, then add the text. Although showing as positive numbers, the locations for the male data are actually negative values.

#The data come from the spi data set in psychTools
bi.bars(spi,"age",grp = "sex", main="Back to back histograms of age by sex")
text(-150,70,'Male')
text(150,70,'Female')

What about the correlations between these demographic variables?

We can examine the bivariate scatter plots of the demographic variables from the spi data set using the pairs.panels function. This is normally done for more continuous data, but it will show patterns of relationships for these categorical data. A prelinary view of the data suggests that we should plot age, p1edu, p2edu, and education against each other. Thus, we select those variables to plot. Because we have 4000 data points, we will smooth the data so that we can see the results more clearly. By default, pairs.panels does not smooth the data. We will show it both ways.

pairs.panels(spi[c(1,4:6)],main= "Age, parental education, and education")

pairs.panels(spi[c(1,4:6)],smoother=TRUE,main= "Age, parental education, and education")

The red lines in each graph are “lowess” (locally smoothed) regressions. These give a better feeling for the results.

Examining these correlations we find that the education of parents tend to correlate, and that education is highly correlated with age. Does this mean that education makes you older? No.

Confidence intervals for the mean

Any data set will have variation. The box plot and violinBy plots are showing this variation. What if we took another set of observations. How much will the mean vary from sample to sample? Just as a sample has a mean and a variance, so do samples of samples. Although the underlying distribution for any variable might not be normal, the distribution of means of these samples will tend towards normality. This is known as the “central limit theorem”. The distribution of these sample means will vary by the standard error of the means where \(se_{x}= \frac{\sigma_{x}}{\sqrt{n}}\)

The help pages for the error.bars function show this effect. We create 20 random samples of size 50 taken from a normal distribution.

set.seed(42)  #to get the same result every time
x <- replicate(20,rnorm(50))  #create 20 sets of 50 random normal numbers
boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,add=TRUE)   #overlay the box plots with standard error plots
abline(h=0)  #put in a horizontal line at the expected value (0).

When we examine this plot, we notice that 19 of the 20 groups have 95% confidence intervals that include the population value (0). However one of the groups (18) has confidence intervals that do not include 0. This is indeed what we would expect: The 95% confidence intervals of the sample means do not overlap the population value 5% of the time. Try this again, what happens? Use a different seed.

set.seed(17)  #to get the same result every time
x <- replicate(20,rnorm(50))  #create 20 sets of 50 random normal numbers
boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,add=TRUE)   #overlay the box plots with standard error plots
abline(h=0)  #put in a horizontal line at the expected value (0).

One more time, but this time, do not draw the box plots. We want to scale the y axis from -1 to 1 and add a label for the x and y axes. Notice that once again, 19 out of 20 confidence intervals include the true value (0).

set.seed(47)  #to get the same result every time
x <- replicate(20,rnorm(50))  #create 20 sets of 50 random normal numbers
#boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,ylim=c(-1,1),xlab="Random Groups",ylab="Means and confidence intervals",main="95% confidence intervals with cats eyes")
abline(h=0)  #put in a horizontal line at the expected value (0).

set.seed(71)  #to get the same result every time
x <- replicate(20,rnorm(50))  #create 20 sets of 50 random normal numbers
#boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,ylim=c(-1,1),xlab="Random Groups",ylab="Means and confidence intervals",main="95% confidence intervals without cats eyes", eyes=FALSE)
abline(h=0)  #put in a horizontal line at the expected value (0).

Confidence intervals for correlations

Just as sample means differ from the population mean, so will sample correlations differ from the true population value. Unless we are doing simulations we do not know the population values, but we can still find the confidence intervals for the correlations.

The basic logic of confidence intervals is to find the estimate (in this case the correlation) and then the confidence interval of the estimate. Normal theory tells us that the standard error of a correlation is \(\sqrt{\frac{1-r^2}{N-2}}\). All of this is done in the corr.test function. (cor.test in core-R will do one correlation at a time, corr.test in the psych package will do many correlations at once).

Lets use the epi.bfi data set which has 231 observations on 11 variables.

First describe them, then show all the correlations using pairs.panels, then find one correlation using cor.test, and then find the correlations of the first 5 variables using corr.test. We set an object (result) = the results of corr.test because we want to print it several different ways.

describe(epi.bfi) #first get the descriptives
##          vars   n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## epiE        1 231  13.33  4.14     14   13.49  4.45   1  22    21 -0.33    -0.06 0.27
## epiS        2 231   7.58  2.69      8    7.77  2.97   0  13    13 -0.57    -0.02 0.18
## epiImp      3 231   4.37  1.88      4    4.36  1.48   0   9     9  0.06    -0.62 0.12
## epilie      4 231   2.38  1.50      2    2.27  1.48   0   7     7  0.66     0.24 0.10
## epiNeur     5 231  10.41  4.90     10   10.39  4.45   0  23    23  0.06    -0.50 0.32
## bfagree     6 231 125.00 18.14    126  125.26 17.79  74 167    93 -0.21    -0.27 1.19
## bfcon       7 231 113.25 21.88    114  113.42 22.24  53 178   125 -0.02     0.23 1.44
## bfext       8 231 102.18 26.45    104  102.99 22.24   8 168   160 -0.41     0.51 1.74
## bfneur      9 231  87.97 23.34     90   87.70 23.72  34 152   118  0.07    -0.55 1.54
## bfopen     10 231 123.43 20.51    125  123.78 20.76  73 173   100 -0.16    -0.16 1.35
## bdi        11 231   6.78  5.78      6    5.97  4.45   0  27    27  1.29     1.50 0.38
## traitanx   12 231  39.01  9.52     38   38.36  8.90  22  71    49  0.67     0.47 0.63
## stateanx   13 231  39.85 11.48     38   38.92 10.38  21  79    58  0.72    -0.01 0.76
pairs.panels(epi.bfi,gap=0)  #a graphical display, bunch the plots up

cor.test(x = epi.bfi[,1],y=epi.bfi[,2])
## 
##  Pearson's product-moment correlation
## 
## data:  epi.bfi[, 1] and epi.bfi[, 2]
## t = 24.203, df = 229, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8071747 0.8806085
## sample estimates:
##       cor 
## 0.8479101
result <- corr.test(epi.bfi[1:5])
result #this just prints the result
## Call:corr.test(x = epi.bfi[1:5])
## Correlation matrix 
##          epiE  epiS epiImp epilie epiNeur
## epiE     1.00  0.85   0.80  -0.22   -0.18
## epiS     0.85  1.00   0.43  -0.05   -0.22
## epiImp   0.80  0.43   1.00  -0.24   -0.07
## epilie  -0.22 -0.05  -0.24   1.00   -0.25
## epiNeur -0.18 -0.22  -0.07  -0.25    1.00
## Sample Size 
## [1] 231
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##         epiE epiS epiImp epilie epiNeur
## epiE    0.00 0.00   0.00   0.00    0.02
## epiS    0.00 0.00   0.00   0.53    0.00
## epiImp  0.00 0.00   0.00   0.00    0.53
## epilie  0.00 0.43   0.00   0.00    0.00
## epiNeur 0.01 0.00   0.26   0.00    0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

The correlation “p values” are given below the diagonal. But because we are doing many correlations, we should adjust these for how many correlations we are examing. Many people use the “Bonferroni” adjustment, which is basically the p value / number of correlations. A more powerful adjustment is the “Holm” correction which rank orders the p value from smallest to largest, and then does a “Bonferroni” correction on the smallest, and then adjusts the remaining p values according to how many are left. See the help menu for p.adjust which explains this.

As is true of most R functions, the basic (default) output hides many other objects. We can find the names of these objects by asking for names.

names(result) 
##  [1] "r"      "n"      "t"      "p"      "p.adj"  "se"     "sef"    "adjust" "sym"    "ci"     "ci2"    "ci.adj"
## [13] "stars"  "Call"

In addition, many functions will give slightly longer output if you ask for it. Here we get the confidence intervals for each correlation, as well as the ‘adjusted’ correlations. These have been adjusted for the fact that we did multiple comparisons.

print(result,short=FALSE)
## Call:corr.test(x = epi.bfi[1:5])
## Correlation matrix 
##          epiE  epiS epiImp epilie epiNeur
## epiE     1.00  0.85   0.80  -0.22   -0.18
## epiS     0.85  1.00   0.43  -0.05   -0.22
## epiImp   0.80  0.43   1.00  -0.24   -0.07
## epilie  -0.22 -0.05  -0.24   1.00   -0.25
## epiNeur -0.18 -0.22  -0.07  -0.25    1.00
## Sample Size 
## [1] 231
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##         epiE epiS epiImp epilie epiNeur
## epiE    0.00 0.00   0.00   0.00    0.02
## epiS    0.00 0.00   0.00   0.53    0.00
## epiImp  0.00 0.00   0.00   0.00    0.53
## epilie  0.00 0.43   0.00   0.00    0.00
## epiNeur 0.01 0.00   0.26   0.00    0.00
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## epiE-epiS        0.81  0.85      0.88  0.00      0.79      0.89
## epiE-epImp       0.75  0.80      0.84  0.00      0.73      0.86
## epiE-epili      -0.34 -0.22     -0.10  0.00     -0.38     -0.06
## epiE-epiNr      -0.30 -0.18     -0.05  0.01     -0.32     -0.02
## epiS-epImp       0.31  0.43      0.53  0.00      0.27      0.56
## epiS-epili      -0.18 -0.05      0.08  0.43     -0.18      0.08
## epiS-epiNr      -0.34 -0.22     -0.09  0.00     -0.37     -0.05
## epImp-epili     -0.36 -0.24     -0.12  0.00     -0.40     -0.07
## epImp-epiNr     -0.20 -0.07      0.06  0.26     -0.22      0.07
## epili-epiNr     -0.37 -0.25     -0.13  0.00     -0.41     -0.08

Showing the confidence intervals (normal theory)

We can create a corPlot of the correlations, showing the confidence ranges:

corPlotUpperLowerCi(result)

We have (by default) scaled the correlations by their “significance”.

Confidence intervals using the bootstrap

Perhaps one of the most important contributions to modern statistics is the concept of the “bootstrap”. This allows us to do tests of confidence intervals even if the underlying distributions are not normal. The basic logic is to resample (with replacement) from our observed data. This treats our initial data set as the population, and then takes many (e.g, 100-10,000) random resamples of the data. We then consider the distribution of these resamples.

The bootstrap was developed by Bradley Efron as a generalization of John Tukey’s ‘Jack knife’. Tukey thought it should be called the shotgun because it blew up the data!. The bootstrap may be used to find the distribution of any parameter and does not require normal theory. We will use it again when we look at the mediate function, as well as the confidence intervals of factor loadings.

The cor.ci function will do the bootstrap for correlations. By default, it also draws a heat map of the correlations, scaling the correlations by “significance”. We can also draw this plot with the lower and upper bounds of the correlations using the corPlotUpperLowerCi function with the output from the cor.ci function.

boot.result <- cor.ci(epi.bfi[1:5],n.iter=1000) #the same data as before

boot.result   #show the results
## Call:corCi(x = x, keys = keys, n.iter = n.iter, p = p, overlap = overlap, 
##     poly = poly, method = method, plot = plot, minlength = minlength, 
##     n = n)
## 
##  Coefficients and bootstrapped confidence intervals 
##         epiE  epiS  epImp epili epiNr
## epiE     1.00                        
## epiS     0.85  1.00                  
## epiImp   0.80  0.43  1.00            
## epilie  -0.22 -0.05 -0.24  1.00      
## epiNeur -0.18 -0.22 -0.07 -0.25  1.00
## 
##  scale correlations and bootstrapped confidence intervals 
##             lower.emp lower.norm estimate upper.norm upper.emp    p
## epiE-epiS        0.81       0.81     0.85       0.88      0.88 0.00
## epiE-epImp       0.75       0.75     0.80       0.85      0.85 0.00
## epiE-epili      -0.33      -0.33    -0.22      -0.11     -0.10 0.00
## epiE-epiNr      -0.29      -0.29    -0.18      -0.06     -0.06 0.00
## epiS-epImp       0.31       0.31     0.43       0.53      0.53 0.00
## epiS-epili      -0.19      -0.18    -0.05       0.08      0.08 0.42
## epiS-epiNr      -0.34      -0.35    -0.22      -0.08     -0.08 0.00
## epImp-epili     -0.35      -0.36    -0.24      -0.12     -0.11 0.00
## epImp-epiNr     -0.20      -0.20    -0.07       0.05      0.04 0.26
## epili-epiNr     -0.37      -0.37    -0.25      -0.12     -0.12 0.00
names(boot.result)  #what else is there?
## [1] "rho"        "means"      "sds"        "tci"        "ptci"       "ci"         "Call"       "replicates"
describe(boot.result$replicates)
##     vars    n  mean   sd median trimmed  mad   min   max range  skew kurtosis se
## X1     1 1000  0.85 0.02   0.85    0.85 0.02  0.77  0.90  0.12 -0.28     0.08  0
## X2     2 1000  0.80 0.03   0.80    0.80 0.02  0.71  0.87  0.16 -0.41     0.51  0
## X3     3 1000 -0.22 0.06  -0.22   -0.22 0.06 -0.40 -0.04  0.36  0.15    -0.05  0
## X4     4 1000 -0.18 0.06  -0.18   -0.18 0.06 -0.34  0.01  0.35  0.10    -0.14  0
## X5     5 1000  0.42 0.06   0.43    0.43 0.06  0.21  0.62  0.40 -0.14     0.05  0
## X6     6 1000 -0.05 0.07  -0.05   -0.05 0.06 -0.26  0.18  0.45  0.09     0.36  0
## X7     7 1000 -0.22 0.07  -0.22   -0.22 0.07 -0.42  0.00  0.42  0.10    -0.28  0
## X8     8 1000 -0.24 0.06  -0.24   -0.24 0.06 -0.43  0.02  0.45  0.14     0.19  0
## X9     9 1000 -0.07 0.06  -0.07   -0.07 0.06 -0.25  0.14  0.38 -0.01    -0.17  0
## X10   10 1000 -0.25 0.07  -0.25   -0.25 0.07 -0.44 -0.04  0.40  0.15    -0.20  0
multi.hist(boot.result$replicates[,1:9]) #show a histogram of some of the bootstrapped correlations

corPlotUpperLowerCi(boot.result)

Compare the empirical confidence regions (lower.emp and upper.emp) with the normal theory estimates (lower.norm and upper.norm), they are very similar, but not identical.

We will see the use of the bootstrap in many of the analyses we will do this quarter. It is a powerful use of cross validation.

Homework for week 2: (called 1a in the canvas assignements)

Using your own data, or a data set choosen from the class data sets, redo these analyses.

To find example data sets, using your browser, go to and then add tha name of the file to load it

e.g.

fn <- "https://personality-project.org/courses/350/datasets/hp.csv"
example <- read.file(fn)
## Data from the .csv file https://personality-project.org/courses/350/datasets/hp.csv has been loaded.