--- title: '350 Week 3: Estimating Reliability' output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=100) #This sets the width of the output, 80 seems to be the default and is too narrow ``` #Estimating internal consistency reliability (Adapted from Revelle, W., & Condon, D. M. (2019). Reliability from α to ω: A tutorial. Psychological Assessment, 31(12), 1395–1411. https://doi.org/10.1037/pas0000754 For a open access preprint, see https://www.personality-project.org/revelle/publications/rc.pa.19.pdf Reliability is a fundamental problem for measurement in all of science for ``(a)ll measurement is befuddled by error" \citep[p 294][]{mcnemar:46}. Perhaps because psychological measures are more befuddled than those of the other natural sciences, psychologists have long studied the problem of reliability \citep{spearman:rho,kuder:37,guttman:45,lord:55,cronbach:51,mcdonald:tt} and this fascination continues to this day \citep{sijtsma:09,rz:09,bentler:09,mcneish:17}. Unfortunately, although recent advances in reliability have gone far beyond the earlier contributions, much of this literature is more technical than readable and is aimed for the specialist rather than the practitioner. This is unfortunate, for all scientists need to appreciate the problems and importance of reliability. Issues of reliability are fundamental to understanding how correlations between observed variables are (attenuated) underestimates of the relationships between the underlying constructs, how observed estimates of a persons score are over estimates of their latent score, and how to estimate the confidence intervals around any particular measurement. Reliability theory is not just for the psychometrician estimating latent variables, it is for the baseball manager trying to predict how well a high performing player will perform next year, how one rater agrees with another in a diagnosis of a patient, or why stock market advisors under perform the market. Understanding the many ways to estimate reliability as well as the ways to use these estimates allows one to better assess individuals and to evaluate selection and prediction techniques. The fundamental question in reliability is to what extent do scores measured at one time and one place with one instrument predict scores at another time and another place and perhaps a different instrument? That is, given a person's score on test 1 at time 1, what do expect that person to do on another test at time 2? The naive belief that they will do just as well on the second measure as they did on the first leads to such mistakes as the belief that punishment improves and rewards diminish subsequent performance \citep{kahneman} or the belief in the "Sports Illustrated jinx". More formally, the expectation on a second measures is just the regression of observations at time 2 on the observations at time 1. If both the time 1 and time 2 measures are equally "befuddled by error" The basic concept of reliability seems to be very simple: observed scores reflect an unknown mixture of signal and noise. To detect the signal, we need to reduce the noise. Reliability thus defined is just the ratio of signal to noise. The signal might be something as esoteric as a gravity wave produced by a collision of two black holes, or as prosaic as the batting average of a baseball player. The noise in gravity wave detectors include the seismic effects of cows wandering in fields near the detector as well as passing ice cream trucks. The noise in batting averages include the effect of opposing pitchers, variations in wind direction, and the effects of jet lag and sleep deprivation. We can enhance the signal/noise ratio by either increasing the signal or reducing the noise. Unfortuately, this classic statement of reliabilty ignores the need for unidimensionality of our measures and equates expected scores with construct scores, a relationship that needs to be tested rather than assumed \citep{borsboom:02}. ## Multiple forms of reliability All tests are assumed to measures something stable or Trait like (T) something that varies over time and reflects State (S), some specific variance that is stable but does not measure our trait of interest (s), and some residual, random error (E). Reliability is then said to be the ratio of trait variance ($V_T$) to total test variance ($V_x$). That is, $$r_{xx} = \frac{V_T}{V_x} = \frac{V_T}{V_T + V_S + V_s + V_e} $$ The problem becomes how to find $V_T$. ### Test-retest What is the correlation of test the same test some later? If time is long enough that the State effect is minimal, and if we have no specific variance, then the correlation of a test with the same test later is what we want: $$r_{xx} = \frac{V_T}{V_x}=\frac{V_T}{V_T + V_S + V_s + V_e} $$ ### Alternate Forms If we do not want to wait for a long time, we can estimate reliabilty by creating another test (an alternate form) that is very similar to our first test but uses different items. By doing this, we can eliminate the effect of the specific item variance and $$r_{xx} = \frac{V_T}{V_x} = \frac{V_T+ V_S }{V_T + V_S + V_s + V_e} $$ This will not remove any state effect on our test, but it does eliminate specific and residual variance. A perhaps crude demonstration of this is to consider the correlation between the odd and even items of the `ability` data set. This is a set of scores on 16 ability items given as part of the SAPA project to 1525 participants. The items are four types: verbal reasoning, numeric reasoning, matrix reasoning, and spatial rotations. These items were developed by [Condon and Revelle](http://dx.doi.org/10.1016/j.intell.2014.01.004) (2014) and discussed further in [Revelle, Dworak and Condon](https://journals.sagepub.com/doi/10.1177/0963721420922178) (2021) and [Dworak, Revelle, Doebler and Dondon](https://www.sciencedirect.com/science/article/pii/S0191886920300957?via%3Dihub) (2020). As with all data sets, we should first find the dimensions (using `dim`) and then `describe` it. We use the `rowSums` function to get our scores. Because we have missing data, we need to set `na.rm=TRUE`. Note, that we should use *rowMeans* instead of *rowSums* to not be confused with the missingness of the items. This leads to subtle differences. Consider the following response patterns where 0 means wrong, 1 means right, and NA means did not answer: 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA They both have a score of 2 correct out of 16, but finding the means yields .125 versus 1.0 ```{r} library(psych) library(psychTools) dim(ability) describe(ability) score1 <- rowSums(ability[,c(1,3,5,7,9,11,13,15)],na.rm=TRUE) #the sum scores for the odds score2 <- rowSums(ability[,seq(from=2, to = 16, by=2)],na.rm=TRUE) # the sum scores for the evens scorem1 <- rowMeans(ability[,c(1,3,5,7,9,11,13,15)],na.rm=TRUE) # the means scores for the odds scorem2 <- rowMeans(ability[,seq(from=2, to = 16, by=2)],na.rm=TRUE) # the mean scores for the evens scores.df <- data.frame(score1,score2,scorem1, scorem2) #form the data.frame pairs.panels(scores.df) #show it graphically lowerCor(scores.df) #or show them in a neat array, rounded to 2 decimals #or, just use the traditional cor function cor(scores.df) #does not work because of missing values rxx = cor(scores.df,use="pairwise") #this does rodd.even <- cor(score1,score2) cor2(scorem1,scorem2) #this finds correlations using pairwise deletion and rounds ``` This would be the correlation we we would expect if we had 8 item tests. That is to say,if we had used just 8 items, this is what we would expect them to correlate with another 8 item test. ### Split half (adjusted for test length) But since we created 16 items, what would the correlation be between two 16 item tests? With a bit of algebra, we can predict it using a formula developed by Spearman and Brown $$rxx = \frac{2r}{1+r}$$ or, in our case ```{r} r_split = 2*rodd.even/(1 + rodd.even) round(r_split,2) ``` But out split was the odd and even items. What happens if we consider all possible splits and then plot the histogram of those splits? For 16 items or less, we can do this quite rapidly. For larger tests, the number of all possible splits increases drastically, and we can sample 10,000 random splits. We use the help menu of `splitHalf` to help us draw the output. ```{r} #to show the histogram of all possible splits for the ability test sp <- splitHalf(ability,raw=TRUE) #this saves the results hist(sp$raw,breaks=101,xlab="Split Half Reliability",ylab="Frequency of split half reliability",main="SplitHalf reliabilities of a test with 16 ability items") sp #and show the results names(sp) #what else is here? sp$minAB #show the items for the worst split sp$maxAB #show the items for the best split ``` #Why do the splits differ in their correlation? The items came in four groups of item types. If we group the first two item types and compare them to the last two item types, we have the most dissimilar pairs of tests. Alternatively, if we group two from each item type and compare that to the other two from each item type, we have the most similar pair of tests. ```{r} firstm <- rowMeans(ability[,1:8],na.rm=TRUE) secondm <- rowMeans(ability[,9:16],na.rm=TRUE) new.scores.df <- data.frame(scores.df,firstm,secondm) pairs.panels(new.scores.df) #or, if we don't want to graph it lowerCor(new.scores.df) ``` ### Internal consistency (with the assumption of tau equivalence) A more conventional procedure, and one that does not show us the range of possible splits, is to use the `alpha` function to find a coefficient that has come to be known as coefficient $\alpha$ (Cronbach, 1951). The `alpha` function will also score our items for us to form a linear composite of the items. ```{r} al <- alpha(ability) summary(al) #see the values of the alpha output (note this summarizes just some of objects) ``` Although the `summary(al)` shows the alpha statistic, we can get more information if we print the object: ```{r} al #this shows more of the output ``` And we can get even more information if we examine the names of what is in the `al` object, and then `describe` the `scores` object ```{r} names(al) #what are the possible objects? describe(al$scores) #descriptive statistics of the results ``` ### The problem of reverse keying The ability items were all keyed in the same direction (correct vs. incorrect). Some personality items need to be reversed scored ("do you like to go to lively parties" is the opposite of "do you tend to keep quiet at parties"). `alpha` by default will not reverse key items, but you can ask it do so automatically. Or, you can specify which items to reverse key. Consider the first 5 items of the `bfi` which are five items measuring "Agreeableness". If we do not reverse score them, we find a very poor alpha (and a warning that we might want to do so.) If we turn on the `check.keys` option, `alpha` will automatically reverse items that correlate negatively with the total score. The `bfi` data set has 25 "Big 5" personality items given as part of the SAPA project. To see the characteristics of this file, we can `dim` it, `describe` it, and then look up the names from an associated dictionary (we use the `lookupFromKeys` function from `psychTools`. It requires a `dictionary` of the items). ```{r} #first show the correlations and the item contents dim(bfi) describe(bfi[1:5]) lowerCor(bfi[1:5]) lookupFromKeys(bfi.keys[1],dictionary=bfi.dictionary[1:4],n=5) ``` #we compare not reversing some items versus reversing them. To reverse an item is the same as subtracting it from the maximum response + minimum response (in this case, subtract from 7) ```{r} alpha(bfi[1:5]) #this will produce a terrible result alpha(bfi[1:5], check.keys = TRUE) # a much better result ``` #Scoring multiple sets of scales at the same time Much more typical is the investigator with multiple scales to score from multiple items. With the use of a scoring `key` you can tell `R` which items go in which direction on what scale. This is detailed in the help pages for the `scoreItems` function. Given a data.frame or matrix of n items and N observations and a list of the direction to score them (a keys.list with k keys) find the sum scores or average scores for each person and each scale. In addition, report Cronbach's alpha, Guttman's Lambda 6, the average r, the scale intercorrelations, and the item by scale correlations (raw and corrected for item overlap). Replace missing values with the item median or mean if desired. Items may be keyed 1 (score it), -1 ) (reverse score it), or 0 (do not score it). Negatively keyed items will be reverse scored. Although prior versions used a keys matrix, it is now recommended to just use a list of scoring keys. See make.keys for a convenient way to make the keys file. If the input is a square matrix, then it is assumed that the input is a covariance or correlation matix and scores are not found, but the item statistics are reported. (Similar functionality to cluster.cor). response.frequencies reports the frequency of item endorsements fore each response category for polytomous or multiple choice items. The parameters for the scoreItems function are laid out in the example which uses the `bfi` data set for an example. `scoreItems(keys, items, totals = FALSE, ilabels = NULL,missing=TRUE, impute="median", delete=TRUE, min = NULL, max = NULL, digits = 2,n.obs=NULL,select=TRUE) ' First we need to see what the items in this data set are, then we need to create a set of keys that tell us which way to score the items. ```{r} data(bfi) names(bfi) keys.list <- list(agree=c("-A1","A2","A3","A4","A5"), conscientious=c("C1","C2","C3","-C4","-C5"),extraversion=c("-E1","-E2","E3","E4","E5"), neuroticism=c("N1","N2","N3","N4","N5"), openness = c("O1","-O2","O3","O4","-O5")) keys.list #show this as a list lookupFromKeys(keys.list,dictionary=bfi.dictionary[1:5],5) #show the items and their content ``` Using this keys.list we can then call scoreItems: ```{r} bfi.scores <- scoreItems(keys=keys.list,items=bfi,min=1,max=6) bfi.scores #show the summary statistics names(bfi.scores) #what other objects are in bfi.scores? ``` What does all this output mean? Most of it is obvious, except perhaps for the scale intercorrelations. Note how the matrix is not symmetric. We have 'corrected' the correlations above the diagonal for scale reliability. This is the correction for attenuation originally developed by Spearman in 1904 and is just the observed correlation divided by the square root of the product of the reliabilities. To use these score, we can define a new object `bfi.scales` and the describe it and do a `pairs.panels` of the results. Note how in the ` pairs.panels` function we specify the plotting character (pch) to be `.`. ```{r} bfi.scales <- bfi.scores$scores describe(bfi.scales) pairs.panels(bfi.scales,pch='.') ``` With this many data points, it is hard to see the density of plot, so we do it again with the `smoother` option. ```{r} pairs.panels(bfi.scales,smoother=TRUE) ``` ###Reliability based upon just the correlation matrix Some of the more important statistics about reliability are actually summaries of the relationships between the variables. That is, we can do them from the correlations rather than the raw data. These estimates will be "standardized", because they treat the data as if it were standardized. These procedures are particularly relevant for those of us who have a great deal of missing data. First find the correlations of the items, and then do the `alpha` or `'scoreItems' on the data. ```{r} R.ab <- lowerCor(ability) #this shows the correlations R.bfi <- cor(bfi,use="pairwise") #this does not show the bfi correlations ``` Then use `alpha` on the ability correlation matrix. Note that the standardized alpha and raw alpha are now identical, because we are doing everything on standardized scores. ```{r} ab.alpha <- alpha(R.ab) ab.alpha #show the results ``` Now, do the same thing for the bfi correlations, but just show the summary. ```{r} bfi.cor.scores <- scoreItems(keys.list,items=R.bfi) summary(bfi.cor.scores) #compare to the prior results round(bfi.cor.scores$corrected - bfi.scores$corrected,2) ``` Perhaps obvious, but doing it on the correlation matrix does not allow us to find scores, nor to descriptive statistics of the scores. ## Further adventures with multiple scales We can see from the correlation matrix that extraversion, neuroticism and openness are barely correlated. What if we made up a scale of those 15 items and allowed the items to be flipped to be most consistent with total score? (This is straight forward but dangerous). The resulting $\alpha$ of this false composite appears to be fairly good, but when we examine all possible splits, we discover we probably should not have done this. ```{r} alpha(bfi[11:25],check.keys=TRUE) #this reverses some items sp.bfi <- splitHalf(bfi[11:25],raw=TRUE) sp.bfi #show the results hist(sp.bfi$raw,breaks=101) #by specifying the breaks we can increase the resolution ``` By examining the split halfs, we can see that there are some that are as low as .36 which suggests that the scale is not measuring one thing. (It is not.) We will explore this problem of multi-dimensionality next. \bibliography{../../all}