--- title: "350.week9 Test theory" author: "William Revelle" date: "5/15/2023" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=100) ``` ```{r} library(psych) library(psychTools) ``` # Some basic test theory The problem of how to construct scales and evaluate their quality is known as test theory. Although sometimes expressed in terms of many equations, the basic concepts are fairly straightforward. Developed by Charles Spearman (1904) and extended by many *psychometricians* in the next 120 years one of the fundamental questions has to do with reliability of the the measurement. Some useful readings include the discussion of reliability by Revelle and Condon (2019) and how to construct scales by Revelle and Garner scale construction. ## Classic test theory: Observed scores = True scores + error ### Classical model of reliability Reliability is the correlation of a test with a test just like it. It is also a way of decomposing the variance of a test. • Observed = True + Error • Reliability = $1- \frac{\sigma^{2}_{error}}{\sigma^{2}_{observed}}$ Reliability is the squared correlation with the domain: $r_{xx} = r_{x_{domain}}^{2 }$ But how do we find a test 'just like it'? 2.Reliability requires variance in observed score As $\sigma_{observed}^{2}$ decreases so will $1- \frac{\sigma^{2}_{error}}{\sigma^{2}_{observed}}$ ### Alternate estimates of reliability all share this need for variance 3.1 Internal Consistency $\alpha$ $\omega_h$ use `alpha` or `omega`. These procedures examine the correlations within the test to predict what the correlation with another test *just like it* will be. 3.2 Alternate Form. This is just the correlation between the forms that are thought to measure the same construct. 3.3 Test-retest Takes into account the time `testRetest` 3.4 Between rater Generalizability theory. `ICC` 4. Item difficulty is ignored, items assumed to be sampled at random ## The "New" psychometrics 1. Model the person as well as the item • People differ in some latent score • Items differ in difficulty ($\theta$) and difficulty ($\delta$) 2. Although the original model is a model of ability tests and item difficulty, this may be applied to the study of attitudes and temperament. We still call the latent trait $\theta$ which reflects the level of the unseen (latent) attribute. • $p(correct|ability,difficulty,...)=f(ability−difficulty)$ $p(endorsement|trait level,difficulty,...)=f(trait level −difficulty)$ • What is the appropriate function? 3. Extensions to polytomous items, particularly rating scale model # Classic Test Theory as 0 parameter IRT ## Psychological attributes are non-linear functions of probability The basic assumption for any psychological attribute is that there is always someone higher than you have observed, and there always someone lower. Most tests are bounded 0-1 (0 to 100%) and thus the observed score has to be a non-linear function of the underlying attribute. Classic Test Theory considers all items to be random replicates of each other and total (or average) score to be the appropriate measure of the underlying attribute. Items are thought to be endorsed (passed) with an increasing probability as a function of the underlying trait. But if the trait is unbounded (just as there is always the possibility of someone being higher than the highest observed score, so is there a chance of someone being lower than the lowest observed score), and the score is bounded (from p=0 to p=1), then the relationship between the latent score and the observed score must be non-linear. This leads to the most simple of all models, one that has no parameters to estimate but is just a non-linear mapping of latent to observed: \begin{equation} \tag{1} p(correct | \theta) = \frac{1}{1+ e^{ - \theta }} . \end{equation}. ## This function is a `logistic` function The logistic, when scaled by 1.702, is very close to the cumulative normal function. We show this using the `curve` function. ```{r logistic} curve(logistic(x*1.702),-3,3,main="logistic (solid) versus cumulative normal (dotted") curve(pnorm(x),lty="dotted",add=TRUE) ``` Why do we compare this to the cumulative normal? If the some attribute is distributed normally (remember the central limit theorem), then the probability of having an attribute value less than x is the cumulative normal. We show this using the `curve` function. ```{r} curve(dnorm(x),-3,3,ylim=c(0,1),main="Normal curve and its sum") curve(pnorm(x), add=TRUE,lty="dotted") ``` # Item Response Theory The Rasch model is a one paramter model (modeling item difficulty). Given a person with attribute value, $\theta$, and a set of items with difficulty, $\delta$, we model the probability of endorsing an item as \begin{equation} \tag{2} p(correct | \delta, \theta) = \frac{1}{1+ e^{\delta - \theta }} . \end{equation}. The problem becomes one of estimating the persons attribute value given a pattern of responses. The approach suggested by Rasch was to solve equation 1 by examining the patterns of successes (1) and failures (0) for a participant. This treats all items as differing in difficulty, but having equal discrimination. We do this with the `rasch` function in the `ltm` package. ## The two parameter model The problem with the Rasch model is that it assumes all items are equally discriminating. By adding an additional *discrimination* parameter, we model two parameters: \begin{equation} \tag{3} p(correct | \delta, \theta, \alpha) = \frac{1}{1+ e^{\alpha*(\delta - \theta) }} . \end{equation}. Using the `MIRT` package, we can find these parameters. ## Factor analysis and 2 PL IRT By finding *tetrachoric* or *polychoric* correlations of the items, there is a one-to-one relationship between the parameters of factor analysis and the results of an IRT analysis. This is shown using the `irt.fa` function in `psych`. ### Simulate some IRT type data ```{r} sim.data <- sim.irt(nvar=9, n=1000) names(sim.data) sim.data$discrimination #by default, these are equal sim.data$difficulty #by default, equally spaced describe(sim.data$items) irt.sim <- irt.fa(sim.data$items) #plot the information curves plot.irt(irt.sim,type="ICC") #plot the Item Characteristic Curves plot.irt(irt.sim, type ="test") #show test information ``` ## Generalizing to polytomous items Use the MSQ item set from last week ```{r} keys <- list(onefactor = cs(active,alert, aroused, -sleepy,-tired, -drowsy,anxious, jittery,nervous, -calm,-relaxed, -at.ease), energy=cs(active,alert, aroused, -sleepy,-tired, -drowsy), tension =cs(anxious, jittery,nervous, -calm,-relaxed, -at.ease)) msq.items <- selectFromKeys(keys[1]) msq.items ``` Use those items from the full set of msq cases. First do it for all 12 items and one factor, and then do a 2 factor solution. ```{r irt} msq.irt <- irt.fa(msq[msq.items]) #now show the two factor solution msq.irt2 <- irt.fa(msq[msq.items] ,2) ``` Show the test information curves. First for the one factor solution, and then for the two factor solution. ```{r plot irt} plot.irt(msq.irt, type="test") plot.irt(msq.irt2, type="test")