--- title: '350:Week 6: More on Regression/Mediation/Moderation' author: "William Revelle" date: "04/29/2024" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=100) ``` # Check the version numbers ```{r} library(psych) #make it active library(psychTools) sessionInfo() #psych should be 2.4.4 ``` #Linear Regression from Raw data A more complete demonstration of regression using the `lm` and `lmCor` functions. We also use the `partial.r`, `lowerCor`, `corPlot` and `lowerUpper` functions. First, get a simulated data set, describe it, and show the correlations as a heat map. We use a slightly more pleasing color gradient. (Taken from the help page for `corPlot`) ```{r} datafilename="http://personality-project.org/r/datasets/psychometrics.prob2.txt" grad <- read.table(datafilename,header=TRUE) #read the data file describe(grad) #describing the data first is important gr <- colorRampPalette(c("#B52127", "white", "#2171B5")) corPlot(grad,numbers=TRUE,gr=gr,main="The (simulated) grad data set") ``` The correlation reflects the *linear* relationship between two variables. For pairs of *standardized variables* for any value of x, we expect $$\hat{y} = r_{xy} x$$. The squared correlation is an index of how much of the *variance* in one variable is associated with variability in the other variable: $$ \sigma_{\hat{y}}^2 = r_{xy}^2 \sigma_y^2 $$ And the unexplained *residual* variance will be $$ \sigma_y^2 - \sigma_{\hat{y}}^2 = (1 - r_{xy}^2) \sigma_y^2$$ ### Partial Correlation The total relationship between the variables reflects some overlap of effects. We can find the *partial correlation* between variables by using the `partial.r` function. Just as a correlation is the ratio of the covariance to the square of the product of the variances, the partial r is the reduced covariance divided by the square root of the product of the partial (residual) variances: $$r_{xy} = \frac{cov_{xy}}{\sqrt{\sigma_x^2 \sigma_y^2}}$$ partial r: $$r_{xy.z} = \frac{r{xy} - r_{xz}r_{yz}}{\sqrt{(1-r_{xz}^2) (1-r_{yx}^2})}$$. This is logically the same as correlation of the x with z removed from y with z removed. That is, the correlation of the residual x with the residual y. We can find *partial correlations* using the `partial.r` function. This will do it for the entire matrix at once. ```{r} rp <- partial.r(grad) #find the partial correlations lowerMat(rp) #show them in a table corPlot(rp,numbers=TRUE,main="Partial Correlations") #show them graphically ``` We can compare two correlation matrices using the `upperLower` function in combination with `corPlot`. We rotate the x axis labels to be vertical to the axis using the `xlas` option. ```{r} r <- lowerCor(grad) #find the correlations, save the results upper <- rp #this is the matrix of partial correlations both <- lowerUpper(lower=r,upper = upper) #put the together corPlot(both,numbers=TRUE,main="Correlations and partial correlations") both <- lowerUpper(lower=r,upper = upper, diff=TRUE) #now plot the differences corPlot(both,numbers=TRUE,main="Differences between correlations and partial r",xlas=3) ``` ## Using the linear model What predicts performance on the Graduate GPA? From the correlation matrix, we see that all three GRE scores do a good job. But they are highly correlated. This is why the partial correlations were noticably smaller. But still, what is the unique effect of each predictor? ```{r} mod1 <- lm(GPA ~ GREV + GREQ + GREA, data=grad) summary(mod1) ``` ## Using the various statistics reported by `lm` As is true of most R functions, much more is returned than displayed. We can see this using either the `str` or the `names` command. ```{r} names(mod1) #show the names of the objects. Some of these have subobjects ``` We can examine the multiple correlation by looking at the original variables, the *fitted* (predicted) values, as well as the *residuals* (what is left over after we fit the data). ```{r} modelfits <- data.frame(c(grad[2:4],grad[8]),fitted=mod1$fitted.values,residuals=mod1$residuals) pairs.panels(modelfits,main="Predictors, criterion, fits and residuals") ``` Note how the residuals are uncorrelaed with the predictors and the fitted value. This is as it should be, for if there were any linear information left over, we would have included it in our prediction. ### Regression diagnostics The basic *lm* function has a number of diagnostic graphs that can be made. ```{r} par(mfrow=c(2,2)) plot(mod1) par(mfrow=c(1,1)) ``` ## Regression as a path model We can also show relationships in terms of path models. Directed arrows are read as "causes", curved arrows as "correlated". The commands for the ```lmCor` function parallel those of the `lm` function. ```{r} mod2 <- lmCor(GPA ~ GREV + GREQ + GREA, data=grad) summary(mod2) ``` These coefficients are very different from the `lm` coefficients. This is because `lmCor` by default, reports the standardized coefficients. Do it again, with the std=FALSE option. ```{r} mod3 <- lmCor(GPA ~ GREV + GREQ + GREA, data=grad, std=FALSE) summary(mod3,digits=8) #these are the unstandardized coefficiets. In the figure, they appear to be 0 summary(mod1) #compare to the lm solution ``` Because the predictors (GRE scores) have so much more variance than the criterion (GPA) we could standardize the data first, and then do the linear model. Compare that solution to the original `lmCor` output. ```{r} z.grad <- data.frame(scale(grad)) #scale converts to z scores mod4 <- lm(GPA ~ GREV + GREQ + GREA, data=z.grad) summary(mod4) #compare p value to mod1 and the coefficients to mod2 ``` ## Mediation We sometimes think that an effect of one variable is *mediated* by another variable. The classic discussion of mediation is in terms of direct paths from the predictor to the criterion (the c path), the path from the predictor to the mediator (the a path), from the mediator to the criterion (the b) path, and the effect of the mediator (the a * b paths) (or indirect path). Thus c' is the difference of the direct - indirect paths (c' = c - ab). This may be done using the `mediate` function. For our example data set, we can develop a simple model that the effect of GREV + GREQ is *mediated* by GREA. We test this model, specifying the mediating variable in parentheses. ```{r} mod5 <- mediate(GPA ~ GREV + GREQ +(GREA), data=grad, std=TRUE) summary(mod5) ``` Since mediation is just a regression with some rearranged terms, it is useful to compare this figure to the one from `lmCor`: ```{r} par(mfrow=c(1,2)) #make a two panel graph mod2 <- lmCor(GPA ~ GREV + GREQ + GREA, data=grad) mod <- mediate(GPA ~ GREV + GREQ +(GREA), data=grad, std=TRUE) par(mfrow=c(1,1)) #set it back again ``` The primary difference of mediation analysis from regular regression is that it allows us to check the confidence intervals of the diret versus the total effects. This difference, because it reflects the difference of the product terms (c - ab) is estimated by the bootstrap. ### Using the bootstrap to find confidence intervals Normal theory allows us to predict how much variation in a estimate we should expect from different random samples of that estimate. However, sometimes the distribution is not known. We can approximate this by forming new samples taken from our original sample. (That is, think of the original sample as the population, and then sample from that population.) When we do the bootstrap, each subject in the original *population sample* is sampled with replacement. This means that some people will be sampled more than once, and some people will not be sampled at all. This leads to the case that the expected number of unique individuals in each boostrap will be 63.2% of the original sample. This is the limit of $$1-(1-1/x)^x $$ x gets larger. (Think about the a sample of size x. The chance of being sampled is 1/x, the chance of not being chosen is 1-1/x. Someone is not chosen if they are not chosen in each of the x trials.) $$1 - 1/e$$ We can show this by using the *curve* function. ```{r} curve(1-(1-1/x)^x,1,20) ``` ## Moderation Moderation is a fancy term for an interaction. Z is said to *moderate* the effect of X upon Y if the X effect depends upon the level of Z. This is just the interaction term in an ANOVA or in a linear model. Remember that we need to zero center the data before using the product term in the `le` function. We can do this directly using `lmCor` or `mediate`. In this case, the data are automatically zero centered and then the product term is found. ### Using `lmCor` or `mediate` We use the `sat.act` data set. ```{r} mod6 <- lmCor(ACT ~ SATV * SATQ, data= sat.act) mod6 ``` Compare this to what happens if we do the regression on the z scores (centered and standarized) from sat.act. ```{r} z.sat.act <- data.frame(scale(sat.act)) #center and standardize mod7 <- lm (ACT ~ SATV * SATQ, data= z.sat.act) summary(mod7) mod8 <- lmCor(ACT ~ SATV * SATQ, data= z.sat.act ) mod8 ``` The difference in the slopes of the regressions between mods 7, 8 and 9 reflect what happens when we standardize the product terms, as well as the way the missing data are treated. ### An example of a function to do regression from correlations The `lmCor` function calls the helper function `matReg` function to do the basic work. We can examine how this works by just asking for the code. The code takes a y variable, x variable, covariance matrix, and the number of observations. These are all found in the `lmCor` function. ```{r} matReg #show the function ``` We can also show the code by examining the text file that creates it. This is available as link on the syllabus at html://personality-project.org/courses/350/matReg.r # Examine the lmCor.diagram Another example of code is the `lmCor.diagram function. It calls a number of diagram functions (part of the psych package). These primitives may be called by you if you want to create your own path diagram. The annotated code is available at html://personality-project.org/courses/350/lmCor.r . This is easier to read than asking for the code directly. Compare the two ways of doing this. ```{r} lmCor.diagram #show the actual code ``` Both of these functions can be examined by stepping through them using the debug option.