#Estimating internal consistency reliability (Adapted from Revelle, W., & Condon, D. M. (2019). Reliability from α to ω: A tutorial. Psychological Assessment, 31(12), 1395–1411. https://doi.org/10.1037/pas0000754

For a open access preprint, see https://www.personality-project.org/revelle/publications/rc.pa.19.pdf

Reliability is a fundamental problem for measurement in all of science for ``(a)ll measurement is befuddled by error” . Perhaps because psychological measures are more befuddled than those of the other natural sciences, psychologists have long studied the problem of reliability and this fascination continues to this day . Unfortunately, although recent advances in reliability have gone far beyond the earlier contributions, much of this literature is more technical than readable and is aimed for the specialist rather than the practitioner. This is unfortunate, for all scientists need to appreciate the problems and importance of reliability.

Issues of reliability are fundamental to understanding how correlations between observed variables are (attenuated) underestimates of the relationships between the underlying constructs, how observed estimates of a persons score are over estimates of their latent score, and how to estimate the confidence intervals around any particular measurement. Reliability theory is not just for the psychometrician estimating latent variables, it is for the baseball manager trying to predict how well a high performing player will perform next year, how one rater agrees with another in a diagnosis of a patient, or why stock market advisors under perform the market. Understanding the many ways to estimate reliability as well as the ways to use these estimates allows one to better assess individuals and to evaluate selection and prediction techniques.

The fundamental question in reliability is to what extent do scores measured at one time and one place with one instrument predict scores at another time and another place and perhaps a different instrument? That is, given a person’s score on test 1 at time 1, what do expect that person to do on another test at time 2? The naive belief that they will do just as well on the second measure as they did on the first leads to such mistakes as the belief that punishment improves and rewards diminish subsequent performance or the belief in the “Sports Illustrated jinx”. More formally, the expectation on a second measures is just the regression of observations at time 2 on the observations at time 1. If both the time 1 and time 2 measures are equally “befuddled by error”

The basic concept of reliability seems to be very simple: observed scores reflect an unknown mixture of signal and noise. To detect the signal, we need to reduce the noise. Reliability thus defined is just the ratio of signal to noise. The signal might be something as esoteric as a gravity wave produced by a collision of two black holes, or as prosaic as the batting average of a baseball player. The noise in gravity wave detectors include the seismic effects of cows wandering in fields near the detector as well as passing ice cream trucks. The noise in batting averages include the effect of opposing pitchers, variations in wind direction, and the effects of jet lag and sleep deprivation. We can enhance the signal/noise ratio by either increasing the signal or reducing the noise. Unfortuately, this classic statement of reliabilty ignores the need for unidimensionality of our measures and equates expected scores with construct scores, a relationship that needs to be tested rather than assumed .

Multiple forms of reliability

All tests are assumed to measures something stable or Trait like (T) something that varies over time and reflects State (S), some specific variance that is stable but does not measure our trait of interest (s), and some residual, random error (E). Reliability is then said to be the ratio of trait variance (\(V_T\)) to total test variance (\(V_x\)). That is,

\[r_{xx} = \frac{V_T}{V_x} = \frac{V_T}{V_T + V_S + V_s + V_e} \]

The problem becomes how to find \(V_T\).

Test-retest

What is the correlation of test the same test some later? If time is long enough that the State effect is minimal, and if we have no specific variance, then the correlation of a test with the same test later is what we want:

\[r_{xx} = \frac{V_T}{V_x}=\frac{V_T}{V_T + V_S + V_s + V_e} \]

Alternate Forms

If we do not want to wait for a long time, we can estimate reliabilty by creating another test (an alternate form) that is very similar to our first test but uses different items. By doing this, we can eliminate the effect of the specific item variance and

\[r_{xx} = \frac{V_T}{V_x} = \frac{V_T+ V_S }{V_T + V_S + V_s + V_e} \]

This will not remove any state effect on our test, but it does eliminate specific and residual variance.

A perhaps crude demonstration of this is to consider the correlation between the odd and even items of the ability data set. This is a set of scores on 16 ability items given as part of the SAPA project to 1525 participants. The items are four types: verbal reasoning, numeric reasoning, matrix reasoning, and spatial rotations. These items were developed by Condon and Revelle (2014) and discussed further in Revelle, Dworak and Condon (2021) and Dworak, Revelle, Doebler and Dondon (2020).

As with all data sets, we should first find the dimensions (using dim) and then describe it.

We use the rowSums function to get our scores. Because we have missing data, we need to set na.rm=TRUE. Note, that we should use rowMeans instead of rowSums to not be confused with the missingness of the items.

This leads to subtle differences.

Consider the following response patterns where 0 means wrong, 1 means right, and NA means did not answer:

1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

They both have a score of 2 correct out of 16, but finding the means yields .125 versus 1.0

library(psych)
library(psychTools)
dim(ability)

## [1] 1525   16

describe(ability)

##           vars    n mean   sd median trimmed mad min max range  skew kurtosis   se
## reason.4     1 1442 0.68 0.47      1    0.72   0   0   1     1 -0.75    -1.44 0.01
## reason.16    2 1463 0.73 0.45      1    0.78   0   0   1     1 -1.02    -0.96 0.01
## reason.17    3 1440 0.74 0.44      1    0.80   0   0   1     1 -1.08    -0.84 0.01
## reason.19    4 1456 0.64 0.48      1    0.68   0   0   1     1 -0.60    -1.64 0.01
## letter.7     5 1441 0.63 0.48      1    0.67   0   0   1     1 -0.56    -1.69 0.01
## letter.33    6 1438 0.61 0.49      1    0.63   0   0   1     1 -0.43    -1.82 0.01
## letter.34    7 1455 0.64 0.48      1    0.68   0   0   1     1 -0.59    -1.65 0.01
## letter.58    8 1438 0.47 0.50      0    0.46   0   0   1     1  0.12    -1.99 0.01
## matrix.45    9 1458 0.55 0.50      1    0.56   0   0   1     1 -0.20    -1.96 0.01
## matrix.46   10 1470 0.57 0.50      1    0.59   0   0   1     1 -0.28    -1.92 0.01
## matrix.47   11 1465 0.64 0.48      1    0.67   0   0   1     1 -0.57    -1.67 0.01
## matrix.55   12 1459 0.39 0.49      0    0.36   0   0   1     1  0.45    -1.80 0.01
## rotate.3    13 1456 0.20 0.40      0    0.13   0   0   1     1  1.48     0.19 0.01
## rotate.4    14 1460 0.22 0.42      0    0.15   0   0   1     1  1.34    -0.21 0.01
## rotate.6    15 1456 0.31 0.46      0    0.27   0   0   1     1  0.80    -1.35 0.01
## rotate.8    16 1460 0.19 0.39      0    0.12   0   0   1     1  1.55     0.41 0.01

score1 <- rowSums(ability[,c(1,3,5,7,9,11,13,15)],na.rm=TRUE)  #the sum scores for the odds
score2 <- rowSums(ability[,seq(from=2, to = 16, by=2)],na.rm=TRUE)  # the sum scores for  the evens
scorem1 <- rowMeans(ability[,c(1,3,5,7,9,11,13,15)],na.rm=TRUE)  # the means scores for the odds
scorem2 <- rowMeans(ability[,seq(from=2, to = 16, by=2)],na.rm=TRUE)  # the mean scores for the evens
scores.df <- data.frame(score1,score2,scorem1, scorem2)  #form the data.frame
pairs.panels(scores.df)  #show it graphically

lowerCor(scores.df)  #or show them in a neat array, rounded to 2 decimals

##         scor1 scor2 scrm1 scrm2
## score1  1.00                   
## score2  0.76  1.00             
## scorem1 0.95  0.70  1.00       
## scorem2 0.71  0.96  0.73  1.00

#or, just use the traditional cor function
cor(scores.df)   #does not work because of missing values

##            score1    score2 scorem1 scorem2
## score1  1.0000000 0.7576345      NA      NA
## score2  0.7576345 1.0000000      NA      NA
## scorem1        NA        NA       1      NA
## scorem2        NA        NA      NA       1

rxx = cor(scores.df,use="pairwise")  #this does
rodd.even <- cor(score1,score2)
cor2(scorem1,scorem2) #this finds correlations using pairwise deletion and rounds

## [1] 0.73

This would be the correlation we we would expect if we had 8 item tests. That is to say,if we had used just 8 items, this is what we would expect them to correlate with another 8 item test.

Split half (adjusted for test length)

But since we created 16 items, what would the correlation be between two 16 item tests? With a bit of algebra, we can predict it using a formula developed by Spearman and Brown

\[rxx = \frac{2r}{1+r}\]

or, in our case

r_split = 2*rodd.even/(1 +  rodd.even)
round(r_split,2)

## [1] 0.86

But out split was the odd and even items. What happens if we consider all possible splits and then plot the histogram of those splits? For 16 items or less, we can do this quite rapidly. For larger tests, the number of all possible splits increases drastically, and we can sample 10,000 random splits. We use the help menu of splitHalf to help us draw the output.

#to show the histogram of all possible splits for the ability test
sp <- splitHalf(ability,raw=TRUE)  #this saves the results
hist(sp$raw,breaks=101,xlab="Split Half Reliability",ylab="Frequency of split half reliability",main="SplitHalf 
    reliabilities of a test with 16 ability items")

sp  #and show the results

## Split half reliabilities  
## Call: splitHalf(r = ability, raw = TRUE)
## 
## Maximum split half reliability (lambda 4) =  0.87
## Guttman lambda 6                          =  0.84
## Average split half reliability            =  0.83
## Guttman lambda 3 (alpha)                  =  0.83
## Guttman lambda 2                          =  0.83
## Minimum split half reliability  (beta)    =  0.73
## Average interitem r =  0.23  with median =  0.21
##                                              2.5% 50% 97.5%
##  Quantiles of split half reliability      =  0.77 0.83 0.86

names(sp)  #what else is here?

##  [1] "maxrb"   "minrb"   "maxAB"   "minAB"   "meanr"   "av.r"    "med.r"   "alpha"   "lambda2"
## [10] "lambda6" "raw"     "ci"      "covar"   "Call"

sp$minAB  #show the items for the worst split

## $A
## [1] "matrix.45" "matrix.46" "matrix.47" "matrix.55" "rotate.3"  "rotate.4"  "rotate.6"  "rotate.8" 
## 
## $B
## [1] "reason.4"  "reason.16" "reason.17" "reason.19" "letter.7"  "letter.33" "letter.34" "letter.58"

sp$maxAB  #show the items for the best split

## $A
## [1] "reason.4"  "reason.16" "letter.33" "letter.58" "matrix.45" "matrix.47" "rotate.4"  "rotate.6" 
## 
## $B
## [1] "reason.17" "reason.19" "letter.7"  "letter.34" "matrix.46" "matrix.55" "rotate.3"  "rotate.8"

#Why do the splits differ in their correlation?

The items came in four groups of item types. If we group the first two item types and compare them to the last two item types, we have the most dissimilar pairs of tests. Alternatively, if we group two from each item type and compare that to the other two from each item type, we have the most similar pair of tests.

firstm <- rowMeans(ability[,1:8],na.rm=TRUE)
secondm <- rowMeans(ability[,9:16],na.rm=TRUE)
new.scores.df <- data.frame(scores.df,firstm,secondm)
pairs.panels(new.scores.df)

#or, if we don't want to graph it
lowerCor(new.scores.df)

##         scor1 scor2 scrm1 scrm2 frstm scndm
## score1  1.00                               
## score2  0.76  1.00                         
## scorem1 0.95  0.70  1.00                   
## scorem2 0.71  0.96  0.73  1.00             
## firstm  0.80  0.78  0.84  0.81  1.00       
## secondm 0.77  0.78  0.80  0.82  0.56  1.00

Internal consistency (with the assumption of tau equivalence)

A more conventional procedure, and one that does not show us the range of possible splits, is to use the alpha function to find a coefficient that has come to be known as coefficient \(\alpha\) (Cronbach, 1951). The alpha function will also score our items for us to form a linear composite of the items.

al <- alpha(ability)
summary(al)  #see the values of the alpha output  (note this summarizes just some of objects)

## 
## Reliability analysis   
##  raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.83      0.83    0.84      0.23 4.9 0.0064 0.51 0.25     0.21

Although the summary(al) shows the alpha statistic, we can get more information if we print the object:

al  #this shows more of the output

## 
## Reliability analysis   
## Call: alpha(x = ability)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.83      0.83    0.84      0.23 4.9 0.0064 0.51 0.25     0.21
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.82  0.83  0.84
## Duhachek  0.82  0.83  0.84
## 
##  Reliability if an item is dropped:
##           raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## reason.4       0.82      0.82    0.82      0.23 4.5   0.0069 0.0060  0.21
## reason.16      0.82      0.82    0.83      0.24 4.7   0.0067 0.0061  0.22
## reason.17      0.82      0.82    0.82      0.23 4.5   0.0069 0.0058  0.21
## reason.19      0.82      0.82    0.83      0.24 4.6   0.0067 0.0061  0.21
## letter.7       0.82      0.82    0.82      0.23 4.5   0.0068 0.0057  0.21
## letter.33      0.82      0.82    0.83      0.24 4.6   0.0068 0.0060  0.21
## letter.34      0.82      0.82    0.82      0.23 4.5   0.0069 0.0056  0.21
## letter.58      0.82      0.82    0.82      0.23 4.5   0.0069 0.0062  0.21
## matrix.45      0.82      0.83    0.83      0.24 4.7   0.0066 0.0060  0.23
## matrix.46      0.82      0.82    0.83      0.24 4.7   0.0067 0.0060  0.22
## matrix.47      0.82      0.82    0.83      0.24 4.6   0.0067 0.0063  0.21
## matrix.55      0.83      0.83    0.83      0.24 4.8   0.0065 0.0058  0.23
## rotate.3       0.82      0.82    0.82      0.23 4.6   0.0067 0.0045  0.23
## rotate.4       0.82      0.82    0.82      0.23 4.5   0.0068 0.0046  0.22
## rotate.6       0.82      0.82    0.82      0.23 4.5   0.0068 0.0051  0.22
## rotate.8       0.82      0.82    0.83      0.24 4.6   0.0067 0.0048  0.23
## 
##  Item statistics 
##              n raw.r std.r r.cor r.drop mean   sd
## reason.4  1442  0.58  0.58  0.54   0.50 0.68 0.47
## reason.16 1463  0.50  0.50  0.44   0.41 0.73 0.45
## reason.17 1440  0.57  0.57  0.54   0.49 0.74 0.44
## reason.19 1456  0.53  0.52  0.47   0.43 0.64 0.48
## letter.7  1441  0.57  0.56  0.52   0.48 0.63 0.48
## letter.33 1438  0.54  0.53  0.48   0.44 0.61 0.49
## letter.34 1455  0.58  0.57  0.54   0.49 0.64 0.48
## letter.58 1438  0.58  0.57  0.52   0.48 0.47 0.50
## matrix.45 1458  0.49  0.48  0.42   0.38 0.55 0.50
## matrix.46 1470  0.50  0.49  0.43   0.40 0.57 0.50
## matrix.47 1465  0.53  0.52  0.47   0.43 0.64 0.48
## matrix.55 1459  0.43  0.42  0.35   0.32 0.39 0.49
## rotate.3  1456  0.52  0.54  0.51   0.44 0.20 0.40
## rotate.4  1460  0.56  0.58  0.56   0.48 0.22 0.42
## rotate.6  1456  0.55  0.56  0.53   0.46 0.31 0.46
## rotate.8  1460  0.49  0.51  0.47   0.41 0.19 0.39
## 
## Non missing response frequency for each item
##              0    1 miss
## reason.4  0.32 0.68 0.05
## reason.16 0.27 0.73 0.04
## reason.17 0.26 0.74 0.06
## reason.19 0.36 0.64 0.05
## letter.7  0.37 0.63 0.06
## letter.33 0.39 0.61 0.06
## letter.34 0.36 0.64 0.05
## letter.58 0.53 0.47 0.06
## matrix.45 0.45 0.55 0.04
## matrix.46 0.43 0.57 0.04
## matrix.47 0.36 0.64 0.04
## matrix.55 0.61 0.39 0.04
## rotate.3  0.80 0.20 0.05
## rotate.4  0.78 0.22 0.04
## rotate.6  0.69 0.31 0.05
## rotate.8  0.81 0.19 0.04

And we can get even more information if we examine the names of what is in the al object, and then describe the scores object

names(al)   #what are the possible objects?

##  [1] "total"         "alpha.drop"    "item.stats"    "response.freq" "keys"          "scores"       
##  [7] "nvar"          "boot.ci"       "boot"          "feldt"         "Unidim"        "var.r"        
## [13] "Fit"           "call"          "title"

describe(al$scores) #descriptive statistics of the results

##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1509 0.51 0.25    0.5    0.51 0.28   0   1     1 0.02    -0.83 0.01

The problem of reverse keying

The ability items were all keyed in the same direction (correct vs. incorrect). Some personality items need to be reversed scored (“do you like to go to lively parties” is the opposite of “do you tend to keep quiet at parties”). alpha by default will not reverse key items, but you can ask it do so automatically. Or, you can specify which items to reverse key.

Consider the first 5 items of the bfi which are five items measuring “Agreeableness”. If we do not reverse score them, we find a very poor alpha (and a warning that we might want to do so.) If we turn on the check.keys option, alpha will automatically reverse items that correlate negatively with the total score.

The bfi data set has 25 “Big 5” personality items given as part of the SAPA project. To see the characteristics of this file, we can dim it, describe it, and then look up the names from an associated dictionary (we use the lookupFromKeys function from psychTools. It requires a dictionary of the items).

#first show the correlations and the item contents
dim(bfi)

## [1] 2800   28

describe(bfi[1:5])

##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## A1    1 2784 2.41 1.41      2    2.23 1.48   1   6     5  0.83    -0.31 0.03
## A2    2 2773 4.80 1.17      5    4.98 1.48   1   6     5 -1.12     1.05 0.02
## A3    3 2774 4.60 1.30      5    4.79 1.48   1   6     5 -1.00     0.44 0.02
## A4    4 2781 4.70 1.48      5    4.93 1.48   1   6     5 -1.03     0.04 0.03
## A5    5 2784 4.56 1.26      5    4.71 1.48   1   6     5 -0.85     0.16 0.02

lowerCor(bfi[1:5])

##    A1    A2    A3    A4    A5   
## A1  1.00                        
## A2 -0.34  1.00                  
## A3 -0.27  0.49  1.00            
## A4 -0.15  0.34  0.36  1.00      
## A5 -0.18  0.39  0.50  0.31  1.00

lookupFromKeys(bfi.keys[1],dictionary=bfi.dictionary[1:4],n=5)

## $agree
##     ItemLabel                                      Item   Giant3          Big6
## A1-     q_146 Am indifferent to the feelings of others. Cohesion Agreeableness
## A2     q_1162         Inquire about others' well-being. Cohesion Agreeableness
## A3     q_1206               Know how to comfort others. Cohesion Agreeableness
## A4     q_1364                            Love children. Cohesion Agreeableness
## A5     q_1419                 Make people feel at ease. Cohesion Agreeableness

we compare not reversing some items versus reversing them.

To reverse an item is the same as subtracting it from the maximum response + minimum response (in this case, subtract from 7)

alpha(bfi[1:5])  #this will produce a terrible result

## Warning in alpha(bfi[1:5]): Some items were negatively correlated with the first principal component and probably 
## should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## Some items ( A1 ) were negatively correlated with the first principal component and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## 
## Reliability analysis   
## Call: alpha(x = bfi[1:5])
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.43      0.46    0.53      0.14 0.85 0.016  4.2 0.74     0.32
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt      0.4  0.43  0.46
## Duhachek   0.4  0.43  0.46
## 
##  Reliability if an item is dropped:
##    raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
## A1      0.72      0.72    0.67     0.397 2.63   0.0087 0.0065 0.375
## A2      0.28      0.30    0.39     0.096 0.43   0.0219 0.1095 0.081
## A3      0.18      0.21    0.31     0.061 0.26   0.0249 0.1014 0.081
## A4      0.25      0.30    0.44     0.099 0.44   0.0229 0.1604 0.104
## A5      0.21      0.24    0.36     0.071 0.31   0.0238 0.1309 0.094
## 
##  Item statistics 
##       n raw.r std.r r.cor r.drop mean  sd
## A1 2784 0.066 0.024 -0.39  -0.31  2.4 1.4
## A2 2773 0.630 0.665  0.58   0.37  4.8 1.2
## A3 2774 0.724 0.742  0.71   0.48  4.6 1.3
## A4 2781 0.686 0.661  0.50   0.37  4.7 1.5
## A5 2784 0.700 0.719  0.64   0.45  4.6 1.3
## 
## Non missing response frequency for each item
##       1    2    3    4    5    6 miss
## A1 0.33 0.29 0.14 0.12 0.08 0.03 0.01
## A2 0.02 0.05 0.05 0.20 0.37 0.31 0.01
## A3 0.03 0.06 0.07 0.20 0.36 0.27 0.01
## A4 0.05 0.08 0.07 0.16 0.24 0.41 0.01
## A5 0.02 0.07 0.09 0.22 0.35 0.25 0.01

alpha(bfi[1:5], check.keys = TRUE)  # a much better result

## Warning in alpha(bfi[1:5], check.keys = TRUE): Some items were negatively correlated with the first principal component and were automatically reversed.
##  This is indicated by a negative sign for the variable name.

## 
## Reliability analysis   
## Call: alpha(x = bfi[1:5], check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean  sd median_r
##        0.7      0.46    0.53      0.14 0.85 0.009  4.7 0.9     0.32
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.69   0.7  0.72
## Duhachek  0.69   0.7  0.72
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
## A1-      0.72      0.72    0.67     0.397 2.63   0.0087 0.0065 0.375
## A2       0.62      0.30    0.39     0.096 0.43   0.0119 0.1095 0.081
## A3       0.60      0.21    0.31     0.061 0.26   0.0124 0.1014 0.081
## A4       0.69      0.30    0.44     0.099 0.44   0.0098 0.1604 0.104
## A5       0.64      0.24    0.36     0.071 0.31   0.0111 0.1309 0.094
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean  sd
## A1- 2784  0.58 0.024 -0.39   0.31  4.6 1.4
## A2  2773  0.73 0.665  0.58   0.56  4.8 1.2
## A3  2774  0.76 0.742  0.71   0.59  4.6 1.3
## A4  2781  0.65 0.661  0.50   0.39  4.7 1.5
## A5  2784  0.69 0.719  0.64   0.49  4.6 1.3
## 
## Non missing response frequency for each item
##       1    2    3    4    5    6 miss
## A1 0.33 0.29 0.14 0.12 0.08 0.03 0.01
## A2 0.02 0.05 0.05 0.20 0.37 0.31 0.01
## A3 0.03 0.06 0.07 0.20 0.36 0.27 0.01
## A4 0.05 0.08 0.07 0.16 0.24 0.41 0.01
## A5 0.02 0.07 0.09 0.22 0.35 0.25 0.01

Scoring multiple sets of scales at the same time

Much more typical is the investigator with multiple scales to score from multiple items. With the use of a scoring key you can tell R which items go in which direction on what scale. This is detailed in the help pages for the scoreItems function.

Given a data.frame or matrix of n items and N observations and a list of the direction to score them (a keys.list with k keys) find the sum scores or average scores for each person and each scale. In addition, report Cronbach’s alpha, Guttman’s Lambda 6, the average r, the scale intercorrelations, and the item by scale correlations (raw and corrected for item overlap). Replace missing values with the item median or mean if desired. Items may be keyed 1 (score it), -1 ) (reverse score it), or 0 (do not score it). Negatively keyed items will be reverse scored. Although prior versions used a keys matrix, it is now recommended to just use a list of scoring keys. See make.keys for a convenient way to make the keys file. If the input is a square matrix, then it is assumed that the input is a covariance or correlation matix and scores are not found, but the item statistics are reported. (Similar functionality to cluster.cor). response.frequencies reports the frequency of item endorsements fore each response category for polytomous or multiple choice items.

The parameters for the scoreItems function are laid out in the example which uses the bfi data set for an example.

`scoreItems(keys, items, totals = FALSE, ilabels = NULL,missing=TRUE, impute=“median”, delete=TRUE, min = NULL, max = NULL, digits = 2,n.obs=NULL,select=TRUE) ’

First we need to see what the items in this data set are, then we need to create a set of keys that tell us which way to score the items.

data(bfi)
names(bfi)

##  [1] "A1"        "A2"        "A3"        "A4"        "A5"        "C1"        "C2"        "C3"       
##  [9] "C4"        "C5"        "E1"        "E2"        "E3"        "E4"        "E5"        "N1"       
## [17] "N2"        "N3"        "N4"        "N5"        "O1"        "O2"        "O3"        "O4"       
## [25] "O5"        "gender"    "education" "age"

keys.list <- list(agree=c("-A1","A2","A3","A4","A5"),
  conscientious=c("C1","C2","C3","-C4","-C5"),extraversion=c("-E1","-E2","E3","E4","E5"),
  neuroticism=c("N1","N2","N3","N4","N5"), openness = c("O1","-O2","O3","O4","-O5")) 
keys.list   #show this as a list

## $agree
## [1] "-A1" "A2"  "A3"  "A4"  "A5" 
## 
## $conscientious
## [1] "C1"  "C2"  "C3"  "-C4" "-C5"
## 
## $extraversion
## [1] "-E1" "-E2" "E3"  "E4"  "E5" 
## 
## $neuroticism
## [1] "N1" "N2" "N3" "N4" "N5"
## 
## $openness
## [1] "O1"  "-O2" "O3"  "O4"  "-O5"

lookupFromKeys(keys.list,dictionary=bfi.dictionary[1:5],5)   #show the items and their content

## $agree
##     ItemLabel                                      Item   Giant3          Big6   Little12
## A1-     q_146 Am indifferent to the feelings of others. Cohesion Agreeableness Compassion
## A2     q_1162         Inquire about others' well-being. Cohesion Agreeableness Compassion
## A3     q_1206               Know how to comfort others. Cohesion Agreeableness Compassion
## A4     q_1364                            Love children. Cohesion Agreeableness Compassion
## A5     q_1419                 Make people feel at ease. Cohesion Agreeableness Compassion
## 
## $conscientious
##     ItemLabel                                  Item    Giant3              Big6        Little12
## C1      q_124               Am exacting in my work. Stability Conscientiousness     Orderliness
## C2      q_530 Continue until everything is perfect. Stability Conscientiousness     Orderliness
## C3      q_619        Do things according to a plan. Stability Conscientiousness     Orderliness
## C4-     q_626       Do things in a half-way manner. Stability Conscientiousness Industriousness
## C5-    q_1949                        Waste my time. Stability Conscientiousness Industriousness
## 
## $extraversion
##     ItemLabel                                  Item     Giant3         Big6      Little12
## E1-     q_712                     Don't talk a lot. Plasticity Extraversion   Sociability
## E2-     q_901 Find it difficult to approach others. Plasticity Extraversion   Sociability
## E3     q_1205         Know how to captivate people. Plasticity Extraversion Assertiveness
## E4     q_1410                  Make friends easily. Plasticity Extraversion   Sociability
## E5     q_1768                          Take charge. Plasticity Extraversion Assertiveness
## 
## $neuroticism
##    ItemLabel                       Item    Giant3                Big6 Little12
## N1     q_952          Get angry easily. Stability Emotional Stability  Balance
## N2     q_974      Get irritated easily. Stability Emotional Stability  Balance
## N3    q_1099 Have frequent mood swings. Stability Emotional Stability  Balance
## N4    q_1479           Often feel blue. Stability Emotional Stability  Balance
## N5    q_1505              Panic easily. Stability Emotional Stability  Balance
## 
## $openness
##     ItemLabel                                      Item     Giant3     Big6  Little12
## O1      q_128                         Am full of ideas. Plasticity Openness Intellect
## O2-     q_316         Avoid difficult reading material. Plasticity Openness Intellect
## O3      q_492 Carry the conversation to a higher level. Plasticity Openness Intellect
## O4     q_1738          Spend time reflecting on things. Plasticity Openness  Openness
## O5-    q_1964     Will not probe deeply into a subject. Plasticity Openness  Openness

Using this keys.list we can then call scoreItems:

bfi.scores <- scoreItems(keys=keys.list,items=bfi,min=1,max=6)
bfi.scores  #show the summary statistics

## Call: scoreItems(keys = keys.list, items = bfi, min = 1, max = 6)
## 
## (Unstandardized) Alpha:
##       agree conscientious extraversion neuroticism openness
## alpha   0.7          0.72         0.76        0.81      0.6
## 
## Standard errors of unstandardized Alpha:
##       agree conscientious extraversion neuroticism openness
## ASE   0.014         0.014        0.013       0.011    0.017
## 
## Average item correlation:
##           agree conscientious extraversion neuroticism openness
## average.r  0.32          0.34         0.39        0.46     0.23
## 
## Median item correlation:
##         agree conscientious  extraversion   neuroticism      openness 
##          0.34          0.34          0.38          0.41          0.22 
## 
##  Guttman 6* reliability: 
##          agree conscientious extraversion neuroticism openness
## Lambda.6   0.7          0.72         0.76        0.81      0.6
## 
## Signal/Noise based upon av.r : 
##              agree conscientious extraversion neuroticism openness
## Signal/Noise   2.3           2.6          3.2         4.3      1.5
## 
## Scale intercorrelations corrected for attenuation 
##  raw correlations below the diagonal, alpha on the diagonal 
##  corrected correlations above the diagonal:
##               agree conscientious extraversion neuroticism openness
## agree          0.70          0.36         0.63      -0.245     0.23
## conscientious  0.26          0.72         0.35      -0.305     0.30
## extraversion   0.46          0.26         0.76      -0.284     0.32
## neuroticism   -0.18         -0.23        -0.22       0.812    -0.12
## openness       0.15          0.19         0.22      -0.086     0.60
## 
##  Average adjusted correlations within and between scales (MIMS)
##               agree cnscn extrv nrtcs opnns
## agree          0.32                        
## conscientious  0.22  0.34                  
## extraversion   0.44  0.26  0.39            
## neuroticism   -0.20 -0.26 -0.28  0.46      
## openness       0.11  0.15  0.18 -0.08  0.23
## 
##  Average adjusted item x scale correlations within and between scales (MIMT)
##               agree cnscn extrv nrtcs opnns
## agree          0.68                        
## conscientious  0.18  0.69                  
## extraversion   0.33  0.19  0.71            
## neuroticism   -0.14 -0.18 -0.17  0.76      
## openness       0.10  0.12  0.14 -0.05  0.62
## 
##  In order to see the item by scale loadings and frequency counts of the data
##  print with the short option = FALSE

names(bfi.scores)  #what other objects are in bfi.scores?

##  [1] "scores"         "missing"        "alpha"          "av.r"           "sn"            
##  [6] "n.items"        "item.cor"       "cor"            "corrected"      "G6"            
## [11] "item.corrected" "response.freq"  "raw"            "ase"            "med.r"         
## [16] "keys"           "MIMS"           "MIMT"           "Call"

What does all this output mean? Most of it is obvious, except perhaps for the scale intercorrelations. Note how the matrix is not symmetric. We have ‘corrected’ the correlations above the diagonal for scale reliability. This is the correction for attenuation originally developed by Spearman in 1904 and is just the observed correlation divided by the square root of the product of the reliabilities.

To use these score, we can define a new object bfi.scales and the describe it and do a pairs.panels of the results. Note how in the pairs.panels function we specify the plotting character (pch) to be ..

bfi.scales <- bfi.scores$scores
describe(bfi.scales)

##               vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## agree            1 2800 4.65 0.89    4.8    4.73 0.89 1.0   6   5.0 -0.77     0.44 0.02
## conscientious    2 2800 4.27 0.95    4.4    4.31 0.89 1.0   6   5.0 -0.41    -0.16 0.02
## extraversion     3 2800 4.15 1.05    4.2    4.20 1.19 1.0   6   5.0 -0.48    -0.19 0.02
## neuroticism      4 2800 3.16 1.19    3.0    3.13 1.19 1.0   6   5.0  0.22    -0.65 0.02
## openness         5 2800 4.59 0.80    4.6    4.62 0.89 1.2   6   4.8 -0.34    -0.28 0.02

pairs.panels(bfi.scales,pch='.')

With this many data points, it is hard to see the density of plot, so we do it again with the smoother option.

pairs.panels(bfi.scales,smoother=TRUE)

Reliability based upon just the correlation matrix

Some of the more important statistics about reliability are actually summaries of the relationships between the variables. That is, we can do them from the correlations rather than the raw data. These estimates will be “standardized”, because they treat the data as if it were standardized. These procedures are particularly relevant for those of us who have a great deal of missing data.

First find the correlations of the items, and then do the alpha or `‘scoreItems’ on the data.

R.ab <- lowerCor(ability)   #this shows the correlations

##           rsn.4 rs.16 rs.17 rs.19 ltt.7 lt.33 lt.34 lt.58 mt.45 mt.46 mt.47 mt.55 rtt.3 rtt.4
## reason.4  1.00                                                                               
## reason.16 0.28  1.00                                                                         
## reason.17 0.40  0.32  1.00                                                                   
## reason.19 0.30  0.25  0.34  1.00                                                             
## letter.7  0.28  0.27  0.29  0.25  1.00                                                       
## letter.33 0.23  0.20  0.26  0.25  0.34  1.00                                                 
## letter.34 0.29  0.26  0.29  0.27  0.40  0.37  1.00                                           
## letter.58 0.29  0.21  0.29  0.25  0.33  0.28  0.32  1.00                                     
## matrix.45 0.25  0.18  0.20  0.22  0.20  0.20  0.21  0.19  1.00                               
## matrix.46 0.25  0.18  0.24  0.18  0.24  0.23  0.27  0.21  0.33  1.00                         
## matrix.47 0.24  0.24  0.27  0.23  0.27  0.23  0.30  0.23  0.24  0.23  1.00                   
## matrix.55 0.16  0.15  0.16  0.15  0.14  0.17  0.14  0.23  0.21  0.14  0.21  1.00             
## rotate.3  0.23  0.16  0.17  0.18  0.18  0.17  0.19  0.24  0.16  0.15  0.20  0.18  1.00       
## rotate.4  0.25  0.20  0.20  0.21  0.23  0.21  0.21  0.27  0.17  0.17  0.20  0.18  0.53  1.00 
## rotate.6  0.25  0.20  0.27  0.19  0.20  0.21  0.19  0.26  0.15  0.20  0.18  0.17  0.43  0.45 
## rotate.8  0.21  0.16  0.18  0.16  0.13  0.14  0.15  0.22  0.16  0.15  0.17  0.19  0.43  0.44 
##          rtt.6 rtt.8
## rotate.6 1.00       
## rotate.8 0.42  1.00

R.bfi <- cor(bfi,use="pairwise")  #this does not show the bfi correlations

Then use alpha on the ability correlation matrix. Note that the standardized alpha and raw alpha are now identical, because we are doing everything on standardized scores.

ab.alpha <- alpha(R.ab)
ab.alpha  #show the results

## 
## Reliability analysis   
## Call: alpha(x = R.ab)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N median_r
##       0.83      0.83    0.84      0.23 4.9     0.21
## 
##     95% confidence boundaries 
##       lower alpha upper
## Feldt  0.68  0.83  0.93
## 
##  Reliability if an item is dropped:
##           raw_alpha std.alpha G6(smc) average_r S/N  var.r med.r
## reason.4       0.82      0.82    0.82      0.23 4.5 0.0060  0.21
## reason.16      0.82      0.82    0.83      0.24 4.7 0.0061  0.22
## reason.17      0.82      0.82    0.82      0.23 4.5 0.0058  0.21
## reason.19      0.82      0.82    0.83      0.24 4.6 0.0061  0.21
## letter.7       0.82      0.82    0.82      0.23 4.5 0.0057  0.21
## letter.33      0.82      0.82    0.83      0.24 4.6 0.0060  0.21
## letter.34      0.82      0.82    0.82      0.23 4.5 0.0056  0.21
## letter.58      0.82      0.82    0.82      0.23 4.5 0.0062  0.21
## matrix.45      0.83      0.83    0.83      0.24 4.7 0.0060  0.23
## matrix.46      0.82      0.82    0.83      0.24 4.7 0.0060  0.22
## matrix.47      0.82      0.82    0.83      0.24 4.6 0.0063  0.21
## matrix.55      0.83      0.83    0.83      0.24 4.8 0.0058  0.23
## rotate.3       0.82      0.82    0.82      0.23 4.6 0.0045  0.23
## rotate.4       0.82      0.82    0.82      0.23 4.5 0.0046  0.22
## rotate.6       0.82      0.82    0.82      0.23 4.5 0.0051  0.22
## rotate.8       0.82      0.82    0.83      0.24 4.6 0.0048  0.23
## 
##  Item statistics 
##              r r.cor r.drop
## reason.4  0.58  0.54   0.49
## reason.16 0.50  0.44   0.40
## reason.17 0.57  0.54   0.49
## reason.19 0.52  0.47   0.43
## letter.7  0.56  0.52   0.47
## letter.33 0.53  0.48   0.44
## letter.34 0.57  0.54   0.48
## letter.58 0.57  0.52   0.48
## matrix.45 0.48  0.42   0.38
## matrix.46 0.49  0.43   0.39
## matrix.47 0.52  0.47   0.43
## matrix.55 0.42  0.35   0.32
## rotate.3  0.54  0.51   0.45
## rotate.4  0.58  0.56   0.49
## rotate.6  0.56  0.53   0.47
## rotate.8  0.51  0.47   0.41

Now, do the same thing for the bfi correlations, but just show the summary.

bfi.cor.scores <- scoreItems(keys.list,items=R.bfi)
summary(bfi.cor.scores)  #compare to the prior results

## Call: NULL
## 
## Scale intercorrelations corrected for attenuation 
##  raw correlations below the diagonal, (standardized) alpha on the diagonal 
##  corrected correlations above the diagonal:
##               agree conscientious extraversion neuroticism openness
## agree          0.71          0.35         0.64      -0.242     0.25
## conscientious  0.25          0.73         0.36      -0.287     0.30
## extraversion   0.47          0.27         0.76      -0.278     0.35
## neuroticism   -0.18         -0.22        -0.22       0.815    -0.11
## openness       0.16          0.20         0.24      -0.074     0.61

round(bfi.cor.scores$corrected - bfi.scores$corrected,2)

##               agree conscientious extraversion neuroticism openness
## agree          0.01         -0.01         0.01        0.00     0.02
## conscientious  0.00          0.01         0.00        0.02     0.00
## extraversion   0.01          0.00         0.00        0.01     0.03
## neuroticism    0.00          0.01         0.00        0.00     0.02
## openness       0.02          0.00         0.02        0.01     0.01

Perhaps obvious, but doing it on the correlation matrix does not allow us to find scores, nor to descriptive statistics of the scores.

Further adventures with multiple scales

We can see from the correlation matrix that extraversion, neuroticism and openness are barely correlated. What if we made up a scale of those 15 items and allowed the items to be flipped to be most consistent with total score? (This is straight forward but dangerous).

The resulting \(\alpha\) of this false composite appears to be fairly good, but when we examine all possible splits, we discover we probably should not have done this.

alpha(bfi[11:25],check.keys=TRUE)  #this reverses some items

## Warning in alpha(bfi[11:25], check.keys = TRUE): Some items were negatively correlated with the first principal component and were automatically reversed.
##  This is indicated by a negative sign for the variable name.

## 
## Reliability analysis   
## Call: alpha(x = bfi[11:25], check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N    ase mean  sd median_r
##       0.77      0.49    0.64     0.059 0.94 0.0064    3 0.7    0.051
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.75  0.77  0.78
## Duhachek  0.75  0.77  0.78
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
## E1       0.75      0.52    0.66     0.072 1.08   0.0067 0.055 0.054
## E2       0.73      0.49    0.64     0.065 0.97   0.0073 0.051 0.044
## E3-      0.75      0.49    0.64     0.065 0.97   0.0068 0.054 0.054
## E4-      0.75      0.53    0.66     0.075 1.13   0.0069 0.052 0.051
## E5-      0.75      0.50    0.65     0.068 1.01   0.0067 0.055 0.054
## N1       0.75      0.39    0.56     0.043 0.63   0.0070 0.053 0.045
## N2       0.75      0.39    0.56     0.044 0.64   0.0069 0.053 0.051
## N3       0.74      0.38    0.56     0.042 0.61   0.0070 0.053 0.044
## N4       0.74      0.41    0.59     0.048 0.70   0.0073 0.053 0.045
## N5       0.75      0.42    0.60     0.048 0.71   0.0070 0.057 0.044
## O1-      0.76      0.49    0.64     0.065 0.97   0.0066 0.057 0.052
## O2       0.77      0.49    0.64     0.063 0.95   0.0064 0.060 0.044
## O3-      0.76      0.50    0.64     0.066 0.99   0.0067 0.055 0.052
## O4       0.77      0.46    0.63     0.057 0.85   0.0063 0.062 0.040
## O5       0.77      0.51    0.65     0.068 1.03   0.0064 0.059 0.051
## 
##  Item statistics 
##        n raw.r std.r    r.cor r.drop mean  sd
## E1  2777  0.48 0.132 -0.00098  0.350  3.0 1.6
## E2  2784  0.65 0.248  0.17609  0.548  3.1 1.6
## E3- 2775  0.49 0.254  0.16228  0.388  3.0 1.4
## E4- 2791  0.52 0.077 -0.04781  0.410  2.6 1.5
## E5- 2779  0.47 0.206  0.08779  0.358  2.6 1.3
## N1  2778  0.55 0.628  0.67944  0.438  2.9 1.6
## N2  2779  0.55 0.621  0.66725  0.435  3.5 1.5
## N3  2789  0.57 0.649  0.68340  0.453  3.2 1.6
## N4  2764  0.63 0.546  0.53908  0.529  3.2 1.6
## N5  2771  0.55 0.537  0.50165  0.430  3.0 1.6
## O1- 2778  0.38 0.253  0.13710  0.279  2.2 1.1
## O2  2800  0.36 0.279  0.15245  0.223  2.7 1.6
## O3- 2772  0.44 0.231  0.12721  0.335  2.6 1.2
## O4  2786  0.19 0.389  0.28391  0.079  4.9 1.2
## O5  2780  0.32 0.189  0.04838  0.203  2.5 1.3
## 
## Non missing response frequency for each item
##       1    2    3    4    5    6 miss
## E1 0.24 0.23 0.15 0.16 0.13 0.09 0.01
## E2 0.19 0.24 0.12 0.22 0.14 0.09 0.01
## E3 0.05 0.11 0.15 0.30 0.27 0.13 0.01
## E4 0.05 0.09 0.10 0.16 0.34 0.26 0.00
## E5 0.03 0.08 0.10 0.22 0.34 0.22 0.01
## N1 0.24 0.24 0.15 0.19 0.12 0.07 0.01
## N2 0.12 0.19 0.15 0.26 0.18 0.10 0.01
## N3 0.18 0.23 0.13 0.21 0.16 0.09 0.00
## N4 0.17 0.24 0.15 0.22 0.14 0.09 0.01
## N5 0.24 0.24 0.14 0.18 0.12 0.09 0.01
## O1 0.01 0.04 0.08 0.22 0.33 0.33 0.01
## O2 0.29 0.26 0.14 0.16 0.10 0.06 0.00
## O3 0.03 0.05 0.11 0.28 0.34 0.20 0.01
## O4 0.02 0.04 0.06 0.17 0.32 0.39 0.01
## O5 0.27 0.32 0.19 0.13 0.07 0.03 0.01

sp.bfi <- splitHalf(bfi[11:25],raw=TRUE)

## Warning in splitHalf(bfi[11:25], raw = TRUE): Some items were negatively correlated with total
## scale and were automatically reversed.

sp.bfi   #show the results

## Split half reliabilities  
## Call: splitHalf(r = bfi[11:25], raw = TRUE)
## 
## Maximum split half reliability (lambda 4) =  0.85
## Guttman lambda 6                          =  0.81
## Average split half reliability            =  0.76
## Guttman lambda 3 (alpha)                  =  0.76
## Guttman lambda 2                          =  0.78
## Minimum split half reliability  (beta)    =  0.36
## Average interitem r =  0.17  with median =  0.13
##                                              2.5% 50% 97.5%
##  Quantiles of split half reliability      =  0.59 0.77 0.83

hist(sp.bfi$raw,breaks=101)   #by specifying the breaks we can increase the resolution

By examining the split halfs, we can see that there are some that are as low as .36 which suggests that the scale is not measuring one thing. (It is not.) We will explore this problem of multi-dimensionality next.

350 Week 3: Estimating Reliability