Back to the Personality Project

## What does a score mean?

After taking a personality survey or an ability test, one is naturally interested in the score one receives. What does the score mean? This is actually a very difficult question and has a number of related answers. These have to do with a) the score itself, b) the reliability of the measuring instrument, and c) the validity of the instrument. As an analogy, consider an archer shooting arrows at a target. The score is what is hit with each arrow, reliability is if the same location is repeately hit by multiple arrows, validity is how close to the target the arrows are hitting. Having high reliability does not ensure validity, but if there is no reliability, then the chance of hitting the target with multiple arrows is 0.

## Scores

At the most simple, scores are merely the average response to the items on a particular scale. (Some people prefer to report the total score and this reflects both the average response and the number of items in the survey. I find the average response a more useful score to report.) But to be told that one has an average of 4.0 on a set of extraversion items is not particularly informative. More useful is expressing that score with respect to the average of (presumably) similar people taking the same survey. An example of the distribution of scores on five personality scales is shown here.

An observed score minus the average score is called a deviation score (deviation = observed - mean). The average deviation score is of course 0. But knowing that the average is 3.0, although helpful, does not allow one to know how much above the average a 4.0 is. That is, is a 1 point deviation score a little or a lot? That depends upon how much variation there is in the scores. A conventional index of variation is the standard deviation (equal to the square root of the variance, which is the average squared deviation from the mean.) Alternative measures of dispersion include the average deviation from the median. Dividing the observed deviation score by the standard deviation produces what is called a standard score, also known as a z-score. Because the mean deviation score is zero, the mean z-score will also be zero. If the distribution of scores is symmetric around the mean, then half of the deviation scores (and thus half of the z-scores) will be negative.

If the distribution is roughly normal, z-scores typically will range from about -3 to 3. Because they represent deviation scores in ratio to the standard deviation, they are unit free. That is, a z-score is not expressed in units reflecting the items, but as a ratio of (item deviation scores)/(standard deviation). Without changing the meaning of scores, but in a hope to make them more understandable to many users, z-scores are sometimes transformed into seemingly more understandable numbers. Typical transformations are multiplying the z-score by 10 and adding 50 (so called T scores are typical in discussing scores from the MMPI), or multiplying the z-score by 15 and adding 100 (converting ability test scores into the "IQ" metric, or multiplying the z-score by 100 and adding 500 (the metric used by Educational Testing Service for the SAT and the GRE.) For all of these transformations, equal differences between people will result in equal differences between the scores.

 Name symbol Transformation Mean Standard Deviation raw X -- Mean sd Deviation x x=X-mean 0 sd standard (z) score z z=x/sd=(X-mean)/sd 0 1 T score T T=z*10+50 50 10 IQ score IQ IQ=z*15+100 100 15 SAT score V V=z*100+500 500 100

A transformation of the raw score that many users (but few psychologists) find easy to understand is the percentile score. This is merely an estimate of the number of people (out of 100) who have a score lower than the observed score. Unfortunately, the percentile score is a non-linear function of the observed or raw score. In other words, equal changes in observed score do not lead to equal changes in percentile scores. Similarly, equal changes in percentiles do not reflect equal changes in raw scores.

Percentiles can be estimated empirically by rank ordering all of the subject scores and then seeing what fraction have a score less than the observed score. However, if the distribution of scores of people taking the survey has a Gaussian or Normal distribution, then it is possible to estimate percentile scores from without actually sorting the observed scores. z-score to percentiles assuming a Normal distribution can be found in tables of the Normal distribution. However, a quick approximation of percentiles that agrees with the Normal distribution is percentile = 100/(1+exp(-1.7*z-score)).

## Reliability

The reliability of a score is an indication of how much an observed score can be expected to be the same if observed again. This "observed again" concept may mean the same test is given again (test-retest), a similar test is given again (alternate form), or different items from the same domain are given again (internal consistency). Generalizations of test reliabilty can be applied to scores assigned by different raters (inter-rater reliability).

You would not expect all scales to have all types of reliability. Measures of personality traits are expected to show stability (high test-retest correlations) across long periods of time. However, measures of mood or emotional state are not expected to show much stability across time. Both traits and states should show consistency across forms and across different items representing the same construct.

Individual items are typically thought of as samples from large or infinite domains of similar or related items. For example, when considering items as measures of vocabulary, or as ways of expressing extraversion, the question of validity may be thought of in terms of the average score on the set of all of words in a language (the language domain) or the total set or domain of the number of ways of expressing extraversion. That is, we can define your vocabulary by the total number of words that you can identify in the dictionary, or the probabilit of recognizing a word randomly choosen from the dictionary. Similarly, we can define a person's extraversion as the average probability of exhibiting any possible extraverted act. However, because the first domain (vocabulary) is very large and the second one is infinite, it is not possible to completely assess the domain score and we become more interested in a representative sample of the domain. Classic test theory emphasizes the correlation between scores on a set of items (of size N) randomly sampled from the much larger domain of items (of size K). In the case of vocabulary, the domain might be all possible words in a dictionary (K > 500,000), in the case of extraversion, the domain might be thought of as infinitely big. We can define validity as the correlation of a test with the domain, and the reliability as the correlation of a test with another test formed of an equal number of items sampled from the same domain.

It is not difficult to show that if we are indeed randomly sampling items from the domain that the correlation of a test of n items with another test of N items is a function of the number of items in the test (n) and the average correlation of each item with every other item
alpha = nr/(1+(n-1)r)
This index, alpha, is also the squared correlation of the test with the domain.

## Validity

(Coming soon! I hope.)