Simulating multivariate structures using R
The following examples shows how to simulate a multivariate structure with a particular measurement model and a particular structural model. This example produces data suitable for demonstrations of regression, correlation, factor analysis, or structural equation modeling. See the mvtnorm package for more elegant ways to simulate covariance matrices. The set of procedures shown here are meant to help the user think about structural models.
The basic logic is in terms of a measurement (factor) model relating observed variables to a set of unobserved factors. Then we have an effects model that describes how the latent variables are interrelated. (This is the basic logic of structural equation modeling, but of course, here we are doing it in reverse.)
First we create a function (mes) that does the work. Parameters to be passed to this function are a factor model (also known as a measurement model) relating how each item relates to a number of latent factors. Then we create an effects model, which is a set of path coefficients between the latent variables.
In this page, as well as most of my examples, the blue text can be copied directly into R.
mes <- function(fmodel,effect,numberofcases=1000) { # define a general function in terms of a factor model and an effects matrix numberofvariables <- dim(fmodel)[1] #problem size determined by input to the function numberoflatent <- dim(fmodel)[2] tmodel <- t(fmodel) #transpose of model # fmodel %*% tmodel #show the resulting measurement structure communality=diag(fmodel%*%tmodel) #find how much to weight true scores and errors given the measurement model uniqueness=1-communality errorweight=sqrt(uniqueness) errorweight=diag(errorweight) #how much to weight the errors latentscores <- matrix(rnorm(numberofcases*(numberoflatent)),numberofcases) #create true scores for the latent variables #round(cor(latentscores),2) #if uncommented, this shows the sample true score correlation matrix of the factors latentscores <- latentscores%*%effect #create true scores to reflect structural relations between the factors # round(cor(latentscores),2) #note that the factors are now correlated truescores <- latentscores %*% tmodel #round(cor(truescores),2) #show the true score correlation matrix (without error) error<- matrix(rnorm(numberofcases*(numberofvariables)),numberofcases) #create normal error scores error=error%*%errorweight observedscore=truescores+error allscores<- data.frame(observedscore,truescores) return(allscores) } #end of function mes
The first example is the classic multiple regression of two predictors and one criterion. We are assuming no errors of measurement and uncorrelated (in the population) predictors. The values of beta for X1 and X2 may be changed for alternative demonstrations. Note that the unique variance of the Y variable is specified in terms of what is not predicted by X1 and X2.
#example 1 2 predictors, 1 criterion variable no errors of measurement beta1 <- .5 beta2 <- .6 uniquey <- sqrt(1-(beta1^2 + beta2^2)) fmodel <- matrix(c (1,0,0, 0,1,0, 0,0,1), nrow=3,ncol=3,byrow=TRUE) effect <- matrix(c ( 1,0,beta1, 0,1,beta2, 0,0,uniquey),nrow=3,ncol=3,byrow=TRUE) observed.df <- mes(fmodel,effect) names(observed.df) <- c("X1","X2", "Y","T1","T2","Ty") round(var(observed.df),2) summary(lm(Y ~ X1+X2,data=observed.df))
This next example is merely two correlated predictors, still with no measurement error.
#example 2 2 predictors, 1 criterion variable no errors of measurement correlated predictors beta1 <- .5 beta2 <- .6 rx1x2 <- .5 uniquex2 <- sqrt(1-rx1x2) uniquey <- sqrt(1-(beta1^2 + beta2^2)) fmodel <- matrix(c (1,0,0, 0,1,0, 0,0,1), nrow=3,ncol=3,byrow=TRUE) effect <- matrix(c ( 1,rx1x2,beta1, 0,beta2,0, 0,0,uniquey),nrow=3,ncol=3,byrow=TRUE) observed.df <- mes(fmodel,effect) names(observed.df) <- c("X1","X2", "Y","T1","T2","Ty") round(var(observed.df),2) round(cor(observed.df),2) summary(lm(Y ~ X1+X2,data=observed.df))
Here we modify example 1 by introducing errors of measurement into the criterion variable. This does not affect the relationships between the true scores, but does between the observed scores.
#example 3 2 predictors, 1 criterion variable errors in measurement of criterion beta1 <- .5 beta2 <- .6 uniquey <- sqrt(1-(beta1^2 + beta2^2)) fmodel <- matrix(c (1,0,0, 0,1,0, 0,0,.5), nrow=3,ncol=3,byrow=TRUE) effect <- matrix(c ( 1,0,beta1, 0,1,beta2, 0,0,uniquey),nrow=3,ncol=3,byrow=TRUE) observed.df <- mes(fmodel,effect) names(observed.df) <- c("X1","X2", "Y","T1","T2","Ty") round(var(observed.df),2) round(cor(observed.df),2) summary(lm(Y ~ X1+X2,data=observed.df))
Now we introduce errors in measurement on the predictors. An interesting exercise is to vary the amount of error for the X1 and X2 variables. Also, compare the effect of changing the beta weights.
#example 4 2 predictors, 1 criterion variable errors in measurement of predictors beta1 <- .5 beta2 <- .6 uniquey <- sqrt(1-(beta1^2 + beta2^2)) fmodel <- matrix(c (.8,0,0, 0,.6,0, 0,0,1), nrow=3,ncol=3,byrow=TRUE) effect <- matrix(c ( 1,0,beta1, 0,1,beta2, 0,0,uniquey),nrow=3,ncol=3,byrow=TRUE) observed.df <- mes(fmodel,effect) names(observed.df) <- c("X1","X2", "Y","T1","T2","Ty") round(var(observed.df),2) round(cor(observed.df),2) summary(lm(Y ~ X1+X2,data=observed.df))
Now introduce errors in measurement into both predictors as well as the criterion variable.
#example 5 2 predictors, 1 criterion variable errors in measurement beta1 <- .5 beta2 <- .6 uniquey <- sqrt(1-(beta1^2 + beta2^2)) fmodel <- matrix(c (.8,0,0, 0,.6,0, 0,0,.5), nrow=3,ncol=3,byrow=TRUE) effect <- matrix(c ( 1,0,beta1, 0,1,beta2, 0,0,uniquey),nrow=3,ncol=3,byrow=TRUE) observed.df <- mes(fmodel,effect) names(observed.df) <- c("X1","X2", "Y","T1","T2","Ty") round(var(observed.df),2) round(cor(observed.df),2) summary(lm(Y ~ X1+X2,data=observed.df))
Now, finally, we can consider what happens in the case where we have errors of measurement in the predictors as well as the criterion, but more importantly, we have multiple indicators of all of the latent variables. This is, of course, an example of the general problem encountered in structural equation modeling. By having multiple indicators of the latent variables, we are able to estimate the errors of measurement that affected all of the previous models but that was not possible to estimate. For this case, notice that the factor model (fmodel) has more than one variable loading on each factor. The effects matrix is the same as before.
This particular example then rescales the observed score matrix to put the variables into more "realistic" terms. The model may be seen as having three estimates of ability, two of achievement, and three of performance. The particular example assumes that there are 3 measures of ability (GREV, GREQ, GREA), two measures of motivation (achievment motivation and anxiety), and three measures of performance (Prelims, GPA, MA). These titles are, of course, arbitrary and can be changed easily.
#example 6 9 variables on 3 factors, with the first two predicting the 3rd title<- "data set for Psychology 405: Psychometric Theory" #<---title goes here #measurement (factor) model for 3 factors and 9 variables fmodel <- matrix(c (.9, 0, 0, .8, 0, 0, .7, .7, 0, 0, .6, 0, 0, -.8, 0, 0, 0, .7, 0, 0, .6, 0, 0, .5), nrow=numberofvariables,ncol=numberoflatent,byrow=TRUE) #structural model for 3 factors (the diagonals reflect unique variance, the off diagonals the structure coefficients) beta1 <- .7 beta2 <- .6 uniquey <- sqrt(1-(beta1^2 + beta2^2)) effect <- matrix(c ( 1,0,beta1, 0,1,beta2, 0,0,uniquey),nrow=3,ncol=3,byrow=TRUE) observedscore <- mes(fmodel,effect) #1000 subjects is the default round(cor(observedscore),2) #show the correlation matrix #give the data "realistic" properties GREV=round(observedscore[,1]*100+500,0) GREQ=round(observedscore[,2]*100+500,0) GREA=round(observedscore[,3]*100+500,0) Ach=round(observedscore[,4]*10+50,0) Anx=round(-observedscore[,5]*10+50,0) Prelim=round(observedscore[,6]+10,0) GPA=round(observedscore[,7]*.5+4,2) MA=round(observedscore[,8]*.5+3,1) data=data.frame(GREV,GREQ,GREA,Ach,Anx,Prelim,GPA,MA) summary(data) #basic summary statistics round(cor(data),2) #show the resulting correlations #it is, of course, identical to the previous one #example 6 9 variables on 3 factors, with the first two predicting the 3rd # The particular example assumes that there are 3 measures of ability (GREV, GREQ, GREA), two measures of motivation (achievment #motivation and anxiety), and three measures of performance (Prelims, GPA, MA). These titles are, of course, arbitrary and can #be changed easily. title<- "data set for Psychology 405: Psychometric Theory" #<---title goes here #measurement (factor) model for 3 factors and 9 variables fmodel <- matrix(c (.9, 0, 0, .8, 0, 0, .7, 0, 0, 0, .6, 0, 0, .8, 0, 0, 0, .7, 0, 0, .6, 0, 0, .5), nrow=numberofvariables,ncol=numberoflatent,byrow=TRUE) #structural model for 3 factors (the diagonals reflect unique variance, the off diagonals the structure coefficients) effect <- matrix(c(1,0,.7, 0,1,.6, 0,.0,.39),nrow=numberoflatent,byrow=TRUE) observedscore <- mes(fmodel,effect) #1000 subjects is the default round(cor(observedscore),2) #show the correlation matrix #give the data "realistic" properties sds <-c(100,100,100,10,10,1,.5,.5,rep(1,8)) means <- c(500,500,500,50,50,10,4,3,rep(0,8)) t <-observedscore *sds+means GREV=round(observedscore[,1]*100+500,0) GREQ=round(observedscore[,2]*100+500,0) GREA=round(observedscore[,3]*100+500,0) Ach=round(observedscore[,4]*10+50,0) Anx=round(-observedscore[,5]*10+50,0) Prelim=round(observedscore[,6]+10,0) GPA=round(observedscore[,7]*.5+4,2) MA=round(observedscore[,8]*.5+3,1) data=data.frame(GREV,GREQ,GREA,Ach,Anx,Prelim,GPA,MA) summary(data) #basic summary statistics round(cor(data),2) #show the resulting correlations #it is, of course, identical to the previous one #this data set has been saved and may be used for another analyses datafilename="http://personality-project.org/R/datasets/psychometrics.prob2.txt" dataset =read.table(datafilename,header=TRUE) #read the data file part of a short guide to R
Version of May 4, 2004 - revised October 22, 2005 to be somewhat more readable
William Revelle
Department of Psychology
Northwestern University