#Updating packages and Graphic presentations
R packages are always under development. The version on CRAN is 2.3.3
The development version on my server is psych
( 2.3.4.
Rather than push the updated version to CRAN, and to reduce the demands
on the CRAN system, it is better to save small changes on local
‘repositories’ which have not gone through all the overhead of CRAN. The
local repository for the psych package is at
http://personality-project.org/r . To install originally from this
repository we use the install.packages
command. Or, we can
check the repository to see if we need to update our package using the
update.packages
command.
#install.packages("psych", repos = "http://personality-project.org/r", type = "source") # or
update.packages("psych", repos = "http://personality-project.org/r", type = "source") # if we already have it does not need to download a new version.
Lets make sure we succeeded. Check the version number of the package.
library(psych) #remember to make this active
library(psychTools) #needed for some of the data sets and convenience functions
sessionInfo() #see what version we have (psych should be 2.3.4)
## R Under development (unstable) (2023-03-17 r83997)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
##
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] psychTools_2.3.3 psych_2.3.4
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 fastmap_1.1.1 xfun_0.38 lattice_0.20-45 cachem_1.0.7 parallel_4.3.0
## [8] knitr_1.42 foreign_0.8-84 htmltools_0.5.5 rmarkdown_2.21 cli_3.6.1 grid_4.3.0 sass_0.4.5
## [15] jquerylib_0.1.4 mnormt_2.1.1 compiler_4.3.0 rstudioapi_0.14 tools_4.3.0 nlme_3.1-162 evaluate_0.20
## [22] bslib_0.4.2 yaml_2.3.7 rlang_1.1.0 jsonlite_1.8.4
To see what has been added to a package you can read the
news file
for that package. You can read all of the news
for a package, or select just the most recent. Here we read the news for
the most recent release of psych
. This is a fair amount of
output which you might want to not bother to look at more than once.
Thus, you can comment it out if you are making successive runs.
#news(Version> "2.3.3",package="psych") #this can be commented out
One of the great powers of R is in its ability to graphically display
data. The human eye is a very good pattern detector and interactive
graphics are useful for exploratory data analysis. Here we show a number
of ways to display data, some use Core-r, some use the
psych
package. We will use the spi
data set
for our examples. But, because the age scores (spi[1]) are so much
bigger than the others, we just do the 2nd to 10th variables.
The spi
data set is an example of the
SAPA project (Synthetic Aperture
Personality Assessment) which uses a web based data collection system to
collect massively missing data sets. Each participant is given a random
subset of temperament, ability and interest items taken from a very
large (> 6,000 items) item pool. From this larger set of items, David
Condon has formed a shorter set of 135 items (the SPI or SAPA
Personalilty Inventory) that do a good job of spanning the major
dimensions of personality. We will use this set for examples of graphics
(below) as well as scale construction (in a few weeks).
We will draw a Tukey BoxPlot for variables 2-10 of the spi data set.
We add a main title to the graph. Note that we do not set the
echo = FALSE
parameter because we want show the R code that
generated the plot.
data(spi)
boxplot(spi[2:10],main="A box Plot from the spi data set",las=3)
A box plot does an important job of summarizing the range of
variation, as well as showing the medians and the interquartile range.
However, we do not really see the distribution of the scores. To do
this, we do a “violin” plot using violinBy
.
violinBy(spi[2:10],main="A violin plot from the spi data set")
We can do the violin plot again, this time breaking it down by subject gender. We specify the orientation of the x and y axis labels as being perpendicular to the axis by specifying las=2 as a graphics parameter.
This introduces the concept of parameter setting to functions. As we discussed last week, these parameters may thought of as adverbs, in that they are modifying our verb (in this case violinBy).
Examine the parameters (for a ful list, ask for help for the function)
violinBy(spi[2:10],grp="sex",las=2, main="A violin plot from the spi data set, broken down by gender")
We see that there is a weakness in the ‘violinBy’ function, in that you really want to have the variable names as well as the group names. After exploring the function, I found that this is possible by specifying another paramter.
violinBy(spi[2:10],grp="sex",grp.name="sex",las=2, main="A violin plot from the spi data set, broken down by gender")
We use the bi.bars
function and then add additional text
to the figure using the text
function.
The text
function adds text to the x, y locations in a
graph. First you draw the graph, then add the text. Although showing as
positive numbers, the locations for the male data are actually negative
values.
#The data come from the spi data set in psychTools
bi.bars(spi,"age",grp = "sex", main="Back to back histograms of age by sex")
text(-150,70,'Male')
text(150,70,'Female')
We can examine the bivariate scatter plots of the demographic
variables from the spi data set using the pairs.panels
function. This is normally done for more continuous data, but it will
show patterns of relationships for these categorical data. A prelinary
view of the data suggests that we should plot age, p1edu, p2edu, and
education against each other. Thus, we select those variables to plot.
Because we have 4000 data points, we will smooth the data so that we can
see the results more clearly. By default, pairs.panels
does
not smooth the data. We will show it both ways.
pairs.panels(spi[c(1,4:6)],main= "Age, parental education, and education")
pairs.panels(spi[c(1,4:6)],smoother=TRUE,main= "Age, parental education, and education")
The red lines in each graph are “lowess” (locally smoothed) regressions. These give a better feeling for the results.
Examining these correlations we find that the education of parents
tend to correlate, and that education is highly correlated with age.
Does this mean that education makes you older? No.
Any data set will have variation. The box plot
and
violinBy
plots are showing this variation. What if we took
another set of observations. How much will the mean vary from sample to
sample? Just as a sample has a mean and a variance, so do samples of
samples. Although the underlying distribution for any variable might not
be normal, the distribution of means of these samples will tend towards
normality. This is known as the “central limit theorem”. The
distribution of these sample means will vary by the standard error of
the means where \(se_{x}=
\frac{\sigma_{x}}{\sqrt{n}}\)
The help pages for the error.bars
function show this
effect. We create 20 random samples of size 50 taken from a normal
distribution.
set.seed(42) #to get the same result every time
x <- replicate(20,rnorm(50)) #create 20 sets of 50 random normal numbers
boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,add=TRUE) #overlay the box plots with standard error plots
abline(h=0) #put in a horizontal line at the expected value (0).
When we examine this plot, we notice that 19 of the 20 groups have 95% confidence intervals that include the population value (0). However one of the groups (18) has confidence intervals that do not include 0. This is indeed what we would expect: The 95% confidence intervals of the sample means do not overlap the population value 5% of the time. Try this again, what happens? Use a different seed.
set.seed(17) #to get the same result every time
x <- replicate(20,rnorm(50)) #create 20 sets of 50 random normal numbers
boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,add=TRUE) #overlay the box plots with standard error plots
abline(h=0) #put in a horizontal line at the expected value (0).
One more time, but this time, do not draw the box plots. We want to scale the y axis from -1 to 1 and add a label for the x and y axes. Notice that once again, 19 out of 20 confidence intervals include the true value (0).
set.seed(47) #to get the same result every time
x <- replicate(20,rnorm(50)) #create 20 sets of 50 random normal numbers
#boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,ylim=c(-1,1),xlab="Random Groups",ylab="Means and confidence intervals",main="95% confidence intervals with cats eyes")
abline(h=0) #put in a horizontal line at the expected value (0).
set.seed(71) #to get the same result every time
x <- replicate(20,rnorm(50)) #create 20 sets of 50 random normal numbers
#boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,ylim=c(-1,1),xlab="Random Groups",ylab="Means and confidence intervals",main="95% confidence intervals without cats eyes", eyes=FALSE)
abline(h=0) #put in a horizontal line at the expected value (0).
Just as sample means differ from the population mean, so will sample correlations differ from the true population value. Unless we are doing simulations we do not know the population values, but we can still find the confidence intervals for the correlations.
The basic logic of confidence intervals is to find the estimate (in
this case the correlation) and then the confidence interval of the
estimate. Normal theory tells us that the standard error of a
correlation is \(\sqrt{\frac{1-r^2}{N-2}}\).
All of this is done in the corr.test
function.
(cor.test
in core-R will do one correlation at a time,
corr.test
in the psych
package will do many
correlations at once).
Lets use the epi.bfi
data set which has 231 observations
on 11 variables.
First describe them, then show all the correlations using
pairs.panels, then find one correlation using cor.test
, and
then find the correlations of the first 5 variables using
corr.test
. We set an object (result) = the results of
corr.test because we want to print it several different ways.
describe(epi.bfi) #first get the descriptives
## vars n mean sd median trimmed mad min max range skew kurtosis se
## epiE 1 231 13.33 4.14 14 13.49 4.45 1 22 21 -0.33 -0.06 0.27
## epiS 2 231 7.58 2.69 8 7.77 2.97 0 13 13 -0.57 -0.02 0.18
## epiImp 3 231 4.37 1.88 4 4.36 1.48 0 9 9 0.06 -0.62 0.12
## epilie 4 231 2.38 1.50 2 2.27 1.48 0 7 7 0.66 0.24 0.10
## epiNeur 5 231 10.41 4.90 10 10.39 4.45 0 23 23 0.06 -0.50 0.32
## bfagree 6 231 125.00 18.14 126 125.26 17.79 74 167 93 -0.21 -0.27 1.19
## bfcon 7 231 113.25 21.88 114 113.42 22.24 53 178 125 -0.02 0.23 1.44
## bfext 8 231 102.18 26.45 104 102.99 22.24 8 168 160 -0.41 0.51 1.74
## bfneur 9 231 87.97 23.34 90 87.70 23.72 34 152 118 0.07 -0.55 1.54
## bfopen 10 231 123.43 20.51 125 123.78 20.76 73 173 100 -0.16 -0.16 1.35
## bdi 11 231 6.78 5.78 6 5.97 4.45 0 27 27 1.29 1.50 0.38
## traitanx 12 231 39.01 9.52 38 38.36 8.90 22 71 49 0.67 0.47 0.63
## stateanx 13 231 39.85 11.48 38 38.92 10.38 21 79 58 0.72 -0.01 0.76
pairs.panels(epi.bfi,gap=0) #a graphical display, bunch the plots up
cor.test(x = epi.bfi[,1],y=epi.bfi[,2])
##
## Pearson's product-moment correlation
##
## data: epi.bfi[, 1] and epi.bfi[, 2]
## t = 24.203, df = 229, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8071747 0.8806085
## sample estimates:
## cor
## 0.8479101
result <- corr.test(epi.bfi[1:5])
result #this just prints the result
## Call:corr.test(x = epi.bfi[1:5])
## Correlation matrix
## epiE epiS epiImp epilie epiNeur
## epiE 1.00 0.85 0.80 -0.22 -0.18
## epiS 0.85 1.00 0.43 -0.05 -0.22
## epiImp 0.80 0.43 1.00 -0.24 -0.07
## epilie -0.22 -0.05 -0.24 1.00 -0.25
## epiNeur -0.18 -0.22 -0.07 -0.25 1.00
## Sample Size
## [1] 231
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## epiE epiS epiImp epilie epiNeur
## epiE 0.00 0.00 0.00 0.00 0.02
## epiS 0.00 0.00 0.00 0.53 0.00
## epiImp 0.00 0.00 0.00 0.00 0.53
## epilie 0.00 0.43 0.00 0.00 0.00
## epiNeur 0.01 0.00 0.26 0.00 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
The correlation “p values” are given below the diagonal. But because
we are doing many correlations, we should adjust these for how many
correlations we are examing. Many people use the “Bonferroni”
adjustment, which is basically the p value / number of correlations. A
more powerful adjustment is the “Holm” correction which rank orders the
p value from smallest to largest, and then does a “Bonferroni”
correction on the smallest, and then adjusts the remaining p values
according to how many are left. See the help menu for
p.adjust
which explains this.
As is true of most R functions, the basic (default) output hides many other objects. We can find the names of these objects by asking for names.
names(result)
## [1] "r" "n" "t" "p" "p.adj" "se" "sef" "adjust" "sym" "ci" "ci2" "ci.adj"
## [13] "stars" "Call"
In addition, many functions will give slightly longer output if you ask for it. Here we get the confidence intervals for each correlation, as well as the ‘adjusted’ correlations. These have been adjusted for the fact that we did multiple comparisons.
print(result,short=FALSE)
## Call:corr.test(x = epi.bfi[1:5])
## Correlation matrix
## epiE epiS epiImp epilie epiNeur
## epiE 1.00 0.85 0.80 -0.22 -0.18
## epiS 0.85 1.00 0.43 -0.05 -0.22
## epiImp 0.80 0.43 1.00 -0.24 -0.07
## epilie -0.22 -0.05 -0.24 1.00 -0.25
## epiNeur -0.18 -0.22 -0.07 -0.25 1.00
## Sample Size
## [1] 231
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## epiE epiS epiImp epilie epiNeur
## epiE 0.00 0.00 0.00 0.00 0.02
## epiS 0.00 0.00 0.00 0.53 0.00
## epiImp 0.00 0.00 0.00 0.00 0.53
## epilie 0.00 0.43 0.00 0.00 0.00
## epiNeur 0.01 0.00 0.26 0.00 0.00
##
## Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
## raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## epiE-epiS 0.81 0.85 0.88 0.00 0.79 0.89
## epiE-epImp 0.75 0.80 0.84 0.00 0.73 0.86
## epiE-epili -0.34 -0.22 -0.10 0.00 -0.38 -0.06
## epiE-epiNr -0.30 -0.18 -0.05 0.01 -0.32 -0.02
## epiS-epImp 0.31 0.43 0.53 0.00 0.27 0.56
## epiS-epili -0.18 -0.05 0.08 0.43 -0.18 0.08
## epiS-epiNr -0.34 -0.22 -0.09 0.00 -0.37 -0.05
## epImp-epili -0.36 -0.24 -0.12 0.00 -0.40 -0.07
## epImp-epiNr -0.20 -0.07 0.06 0.26 -0.22 0.07
## epili-epiNr -0.37 -0.25 -0.13 0.00 -0.41 -0.08
We can create a corPlot of the correlations, showing the confidence ranges:
corPlotUpperLowerCi(result)
We have (by default) scaled the correlations by their “significance”.
Perhaps one of the most important contributions to modern statistics is the concept of the “bootstrap”. This allows us to do tests of confidence intervals even if the underlying distributions are not normal. The basic logic is to resample (with replacement) from our observed data. This treats our initial data set as the population, and then takes many (e.g, 100-10,000) random resamples of the data. We then consider the distribution of these resamples.
The bootstrap was developed by Bradley Efron as a generalization of
John Tukey’s ‘Jack knife’. Tukey thought it should be called the shotgun
because it blew up the data!. The bootstrap may be used to find the
distribution of any parameter and does not require normal theory. We
will use it again when we look at the mediate
function, as
well as the confidence intervals of factor loadings.
The cor.ci
function will do the bootstrap for
correlations. By default, it also draws a heat map of the correlations,
scaling the correlations by “significance”. We can also draw this plot
with the lower and upper bounds of the correlations using the
corPlotUpperLowerCi
function with the output from the
cor.ci
function.
boot.result <- cor.ci(epi.bfi[1:5],n.iter=1000) #the same data as before
boot.result #show the results
## Call:corCi(x = x, keys = keys, n.iter = n.iter, p = p, overlap = overlap,
## poly = poly, method = method, plot = plot, minlength = minlength,
## n = n)
##
## Coefficients and bootstrapped confidence intervals
## epiE epiS epImp epili epiNr
## epiE 1.00
## epiS 0.85 1.00
## epiImp 0.80 0.43 1.00
## epilie -0.22 -0.05 -0.24 1.00
## epiNeur -0.18 -0.22 -0.07 -0.25 1.00
##
## scale correlations and bootstrapped confidence intervals
## lower.emp lower.norm estimate upper.norm upper.emp p
## epiE-epiS 0.81 0.81 0.85 0.88 0.88 0.00
## epiE-epImp 0.75 0.75 0.80 0.85 0.85 0.00
## epiE-epili -0.33 -0.33 -0.22 -0.11 -0.10 0.00
## epiE-epiNr -0.29 -0.29 -0.18 -0.06 -0.06 0.00
## epiS-epImp 0.31 0.31 0.43 0.53 0.53 0.00
## epiS-epili -0.19 -0.18 -0.05 0.08 0.08 0.42
## epiS-epiNr -0.34 -0.35 -0.22 -0.08 -0.08 0.00
## epImp-epili -0.35 -0.36 -0.24 -0.12 -0.11 0.00
## epImp-epiNr -0.20 -0.20 -0.07 0.05 0.04 0.26
## epili-epiNr -0.37 -0.37 -0.25 -0.12 -0.12 0.00
names(boot.result) #what else is there?
## [1] "rho" "means" "sds" "tci" "ptci" "ci" "Call" "replicates"
describe(boot.result$replicates)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1000 0.85 0.02 0.85 0.85 0.02 0.77 0.90 0.12 -0.28 0.08 0
## X2 2 1000 0.80 0.03 0.80 0.80 0.02 0.71 0.87 0.16 -0.41 0.51 0
## X3 3 1000 -0.22 0.06 -0.22 -0.22 0.06 -0.40 -0.04 0.36 0.15 -0.05 0
## X4 4 1000 -0.18 0.06 -0.18 -0.18 0.06 -0.34 0.01 0.35 0.10 -0.14 0
## X5 5 1000 0.42 0.06 0.43 0.43 0.06 0.21 0.62 0.40 -0.14 0.05 0
## X6 6 1000 -0.05 0.07 -0.05 -0.05 0.06 -0.26 0.18 0.45 0.09 0.36 0
## X7 7 1000 -0.22 0.07 -0.22 -0.22 0.07 -0.42 0.00 0.42 0.10 -0.28 0
## X8 8 1000 -0.24 0.06 -0.24 -0.24 0.06 -0.43 0.02 0.45 0.14 0.19 0
## X9 9 1000 -0.07 0.06 -0.07 -0.07 0.06 -0.25 0.14 0.38 -0.01 -0.17 0
## X10 10 1000 -0.25 0.07 -0.25 -0.25 0.07 -0.44 -0.04 0.40 0.15 -0.20 0
multi.hist(boot.result$replicates[,1:9]) #show a histogram of some of the bootstrapped correlations
corPlotUpperLowerCi(boot.result)
Compare the empirical confidence regions (lower.emp and upper.emp) with the normal theory estimates (lower.norm and upper.norm), they are very similar, but not identical.
We will see the use of the bootstrap in many of the analyses we will do this quarter. It is a powerful use of cross validation.
Using your own data, or a data set choosen from the class data sets, redo these analyses.
To find example data sets, using your browser, go to and then add tha name of the file to load it
e.g.
fn <- "https://personality-project.org/courses/350/datasets/hp.csv"
example <- read.file(fn)
## Data from the .csv file https://personality-project.org/courses/350/datasets/hp.csv has been loaded.