library(psych)
library(psychTools)

Exploratory and Confirmatory Factor Analysis

Exploratory factor analysis (EFA) is a technique to better understand the complexity of your data. Confirmatory Factor Analysis allows you to test hypotheses about the structure of your data. Both are useful procedures. We have previously discussed EFA, today we talk about CFA.

the lavaan package is both user friendly and very powerful

lavaan has been described as the gateway drug to R. It is a powerful system for doing latent variable analysis. It has a very helfpul web page associated with it and an active user community.

You must first go to CRAN and install lavaan. Once installed, make it active:

library(lavaan)  #we will use this one today
## This is lavaan 0.6-15
## lavaan is FREE software! Please report any bugs.
## 
## Attaching package: 'lavaan'
## The following object is masked from 'package:psych':
## 
##     cor2cov

Like most packages, lavaan has built in example data sets. We will use the one from the help menu for lavaan. The data come from Holzinger and Swineford who examined the performance of elementary school children on a number of cognitive tests. The example is for 9 of the variables. The complete data set and a description of it is in the holzinger.swineford dataset in the psychTools package.

See the much longer discussion in the class notes.

lavaan syntax is very straight forward:

Regression 
  y ~ f1 + f2 + x1 + x2
  f1 ~ f2 + f3
  f2 ~ f3 + x1 + x2

 Latent variables 
 f1 =~ y1 + y2 + y3
 f2 =~ y4 + y5 + y6
 f3 =~ y7 + y8 + y9 + y10

  Variances and covariances
  y1 ~~ y1
  y1 ~~ y2
  f1 ~~ f2

   Intercepts 
   y1 ~ 1
   f1 ~ 1

We enter the code into lavaan by quoting it as a string:

  model ='  here is some lavaan code'
  

Lets do this for the 9 variables in the HolzingerSwineford data set.

# The Holzinger and Swineford (1939) example
HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

fit <- lavaan(HS.model, data=HolzingerSwineford1939,
              auto.var=TRUE, auto.fix.first=TRUE,
              auto.cov.lv.x=TRUE)
summary(fit, fit.measures=TRUE)
## lavaan 0.6.15 ended normally after 35 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
## 
##   Number of observations                           301
## 
## Model Test User Model:
##                                                       
##   Test statistic                                85.306
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               918.852
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.931
##   Tucker-Lewis Index (TLI)                       0.896
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3737.745
##   Loglikelihood unrestricted model (H1)      -3695.092
##                                                       
##   Akaike (AIC)                                7517.490
##   Bayesian (BIC)                              7595.339
##   Sample-size adjusted Bayesian (SABIC)       7528.739
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092
##   90 Percent confidence interval - lower         0.071
##   90 Percent confidence interval - upper         0.114
##   P-value H_0: RMSEA <= 0.050                    0.001
##   P-value H_0: RMSEA >= 0.080                    0.840
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.100    5.554    0.000
##     x3                0.729    0.109    6.685    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.113    0.065   17.014    0.000
##     x6                0.926    0.055   16.703    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.180    0.165    7.152    0.000
##     x9                1.082    0.151    7.155    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.408    0.074    5.552    0.000
##     speed             0.262    0.056    4.660    0.000
##   textual ~~                                          
##     speed             0.173    0.049    3.518    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.549    0.114    4.833    0.000
##    .x2                1.134    0.102   11.146    0.000
##    .x3                0.844    0.091    9.317    0.000
##    .x4                0.371    0.048    7.779    0.000
##    .x5                0.446    0.058    7.642    0.000
##    .x6                0.356    0.043    8.277    0.000
##    .x7                0.799    0.081    9.823    0.000
##    .x8                0.488    0.074    6.573    0.000
##    .x9                0.566    0.071    8.003    0.000
##     visual            0.809    0.145    5.564    0.000
##     textual           0.979    0.112    8.737    0.000
##     speed             0.384    0.086    4.451    0.000
lavaan.diagram(fit) #show the results

This solution fixes the variances of the first variable for each factor to be 1. Alternatively, we can fix the variance of the factors to be 1.

# The Holzinger and Swineford (1939) example
HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

fit <- lavaan(HS.model, data=HolzingerSwineford1939,
              auto.var=TRUE, auto.fix.first=TRUE,
              auto.cov.lv.x=TRUE,std.lv=TRUE)    #note this specification
summary(fit, fit.measures=TRUE)
## lavaan 0.6.15 ended normally after 20 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
## 
##   Number of observations                           301
## 
## Model Test User Model:
##                                                       
##   Test statistic                                85.306
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               918.852
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.931
##   Tucker-Lewis Index (TLI)                       0.896
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3737.745
##   Loglikelihood unrestricted model (H1)      -3695.092
##                                                       
##   Akaike (AIC)                                7517.490
##   Bayesian (BIC)                              7595.339
##   Sample-size adjusted Bayesian (SABIC)       7528.739
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092
##   90 Percent confidence interval - lower         0.071
##   90 Percent confidence interval - upper         0.114
##   P-value H_0: RMSEA <= 0.050                    0.001
##   P-value H_0: RMSEA >= 0.080                    0.840
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                0.900    0.081   11.128    0.000
##     x2                0.498    0.077    6.429    0.000
##     x3                0.656    0.074    8.817    0.000
##   textual =~                                          
##     x4                0.990    0.057   17.474    0.000
##     x5                1.102    0.063   17.576    0.000
##     x6                0.917    0.054   17.082    0.000
##   speed =~                                            
##     x7                0.619    0.070    8.903    0.000
##     x8                0.731    0.066   11.090    0.000
##     x9                0.670    0.065   10.305    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.459    0.064    7.189    0.000
##     speed             0.471    0.073    6.461    0.000
##   textual ~~                                          
##     speed             0.283    0.069    4.117    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.549    0.114    4.833    0.000
##    .x2                1.134    0.102   11.146    0.000
##    .x3                0.844    0.091    9.317    0.000
##    .x4                0.371    0.048    7.779    0.000
##    .x5                0.446    0.058    7.642    0.000
##    .x6                0.356    0.043    8.277    0.000
##    .x7                0.799    0.081    9.823    0.000
##    .x8                0.488    0.074    6.573    0.000
##    .x9                0.566    0.071    8.003    0.000
##     visual            1.000                           
##     textual           1.000                           
##     speed             1.000
lavaan.diagram(fit) #show the results

Compare the CFA solution to the EFA solution from psych

f3 <- fa(HolzingerSwineford1939[7:15], 3) #specify we want three factors
## Loading required namespace: GPArotation
f3  #show the results
## Factor Analysis using method =  minres
## Call: fa(r = HolzingerSwineford1939[7:15], nfactors = 3)
## Standardized loadings (pattern matrix) based upon correlation matrix
##      MR1   MR3   MR2   h2   u2 com
## x1  0.20  0.59  0.03 0.48 0.52 1.2
## x2  0.04  0.51 -0.12 0.26 0.74 1.1
## x3 -0.06  0.69  0.02 0.45 0.55 1.0
## x4  0.85  0.02  0.01 0.73 0.27 1.0
## x5  0.89 -0.06  0.01 0.75 0.25 1.0
## x6  0.81  0.08 -0.01 0.69 0.31 1.0
## x7  0.04 -0.15  0.74 0.52 0.48 1.1
## x8 -0.03  0.13  0.69 0.52 0.48 1.1
## x9  0.03  0.38  0.46 0.46 0.54 2.0
## 
##                        MR1  MR3  MR2
## SS loadings           2.24 1.34 1.27
## Proportion Var        0.25 0.15 0.14
## Cumulative Var        0.25 0.40 0.54
## Proportion Explained  0.46 0.28 0.26
## Cumulative Proportion 0.46 0.74 1.00
## 
##  With factor correlations of 
##      MR1  MR3  MR2
## MR1 1.00 0.32 0.21
## MR3 0.32 1.00 0.26
## MR2 0.21 0.26 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 3 factors are sufficient.
## 
## df null model =  36  with the objective function =  3.05 with Chi Square =  904.1
## df of  the model are 12  and the objective function was  0.08 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic n.obs is  301 with the empirical chi square  7.87  with prob <  0.8 
## The total n.obs was  301  with Likelihood Chi Square =  22.56  with prob <  0.032 
## 
## Tucker Lewis Index of factoring reliability =  0.963
## RMSEA index =  0.054  and the 90 % confidence intervals are  0.016 0.088
## BIC =  -45.93
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR3  MR2
## Correlation of (regression) scores with factors   0.94 0.84 0.85
## Multiple R square of scores with factors          0.89 0.71 0.72
## Minimum correlation of possible factor scores     0.78 0.42 0.45
fa.diagram(f3) #show the results, suppress small loadings

fa.diagram(f3,simple=FALSE)  #don't suppress cross loading

fa.diagram(f3,cut =.1) #supress cross loadings

Fixing parameters

Unlike EFA, CFA allows us to fix certain parameters to particular values.

Try the HS problem with orthogonal factors

 fit.HS.ortho <- cfa(HS.model, data=HolzingerSwineford1939, orthogonal=TRUE)
summary(fit.HS.ortho)
## lavaan 0.6.15 ended normally after 32 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        18
## 
##   Number of observations                           301
## 
## Model Test User Model:
##                                                       
##   Test statistic                               153.527
##   Degrees of freedom                                27
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.778    0.141    5.532    0.000
##     x3                1.107    0.214    5.173    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.133    0.067   16.906    0.000
##     x6                0.924    0.056   16.391    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.225    0.190    6.460    0.000
##     x9                0.854    0.121    7.046    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.000                           
##     speed             0.000                           
##   textual ~~                                          
##     speed             0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.835    0.118    7.064    0.000
##    .x2                1.065    0.105   10.177    0.000
##    .x3                0.633    0.129    4.899    0.000
##    .x4                0.382    0.049    7.805    0.000
##    .x5                0.416    0.059    7.038    0.000
##    .x6                0.369    0.044    8.367    0.000
##    .x7                0.746    0.086    8.650    0.000
##    .x8                0.366    0.097    3.794    0.000
##    .x9                0.696    0.072    9.640    0.000
##     visual            0.524    0.130    4.021    0.000
##     textual           0.969    0.112    8.640    0.000
##     speed             0.437    0.097    4.520    0.000

Because these are nested models, can compare whether they differ (they do)

anova(fit,fit.HS.ortho)
## 
## Chi-Squared Difference Test
## 
##              Df    AIC    BIC   Chisq Chisq diff   RMSEA Df diff Pr(>Chisq)    
## fit          24 7517.5 7595.3  85.305                                          
## fit.HS.ortho 27 7579.7 7646.4 153.527     68.222 0.26875       3  1.026e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The benefit of having fewer parameters to estimate was overweighed by the lack of fit.