--- title: '350: Correlation and Regression' author: "William Revelle" date: "05/01/2024" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=100) ``` # Detecting multivariate outliers If we describe a data set (e.g. *sat.act*) we can express all scores as standard scores, z $$z = \frac{X - \bar{X}}{\sqrt{Var_x}}$$ If the data are normally distributed, then we would expect the distribution of z to be normal and to not have many observations $> |3|$. But what if we have multiple variables? We can find the *multivariate* equivalent of z by finding the *Mahalanobis Distance* which considers the squared distances from the centroid, weighting each variable so that they are independent. We define $\Sigma$ as the variance/covariance matrix of the X variables. The inverse of $\Sigma$ is shown as $\Sigma^{-1}$ and weights each variable by the independent parts of the variable. The inverse of the correlation matrix partials out each variable from all the other variables. The result is that we are considering the independent effects: $$D^2 = ({X}_{ij} - \bar{X_j})' \Sigma^{-1} ({X}_{ij} - \bar{X_j})$$ That is, find how much the each subject differs from the mean differ on each variable. (This is a generalization of a deviation score from the mean to the deviation from the multivariate mean), and then weight each difference by the inverse of the correlation of the variables. We will use the *outlier* function in *psych* to do this, and to plot the results. Then we will examine how it does it. A similar calculation is done when we find the `cohen.d` between two groups on multiple variables. $D^2$ is analogous to the $R^2$ of multiple regression. ```{r} library(psych) library(psychTools) d2 <- outlier(sat.act) #finds the outliers and draws a graph describe(d2) ``` The *outlier* function finds the $D^2$ statistic and then plots it against the expected values of a $\chi^2$ distribution. (Remember that a $\chi^2$ is just a squared z score.) The plot function used is known as a *qq plot* for a quantile by quantile plot. It sorts the data and then plots the x and y values. We show this first for 1000 normal deviates and their squares, plotted against the quantiles of $\chi^2$. ```{r chisq} set.seed(42) n.obs <- 10000 z <- rnorm(n.obs) #n.obs normal deviates z2 <- z^2 #and their squares Chi2 <- qchisq(ppoints(n.obs), df =1) qqplot(Chi2,z2,main="A quantile plot of squared normal deviates") abline(0, 1, col = "gray") #show the equality line ``` Now, we do it for our real data. ```{r} n.obs <- NROW(d2) nvar <- 6 Chi2 <- qchisq(ppoints(n.obs), df =nvar) qqplot(Chi2, d2, main = expression("Q-Q plot of Mahalanobis" * ~D^2 * " vs. quantiles of" * ~chi[nvar]^2)) abline(0, 1, col = "gray") ``` Who are these unusual people? Join *sat.act* to the *d2* object using the *cbind* function, and plot it. ```{r} sat.d2 <- cbind(sat.act,d2) #combine the columns to make a new variable #sat.d2 <- data.frame(sat.act,d2) #this does the same pairs.panels(sat.d2,bg=c("yellow","blue")[(d2 > 25)+1],pch=21,gap=0) ``` ## Sort the data so we can see the results more clearly The *order* function will take a vector and return the index of the order of the cells. This can be done either in an increasing or decreasing order. ```{r} ord <- order(d2,decreasing=TRUE) sat.d2 <- sat.d2[ord,] #rearrange them to be in the order of the d2 variable headTail(sat.d2,top=10,bottom=5) ``` By inspection, it seems as if the first 4 subjects are probably incorrect, the others are somewhat questionable. ## Sort a data frame by a number of columns The *order* function will give the order for one column. The *dfOrder* function will do this for multiple columns at once. ```{r} new.sat <-dfOrder(sat.act) #sort the entire file headTail(new.sat,top=10,bottom=10) ``` # Effect size as a index of the difference of groups There are many ways of reporting how two groups differ. Cohen's d statistic is just the differences of means expressed in terms of the pooled within group standard deviation. This is insensitive to sample size. r is the a universal measure of effect size that is a simple function of d, but is bounded -1 to 1. The t statistic is merely d * sqrt(n)/2 and thus reflects sample size. Confidence intervals for Cohen's d may be found by converting the d to a t, finding the confidence intervals for t, and then converting those back to ds. This take advantage of the uniroot function and the non-centrality parameter of the t distribution. See `cohen.d` We use our `sat.act` data set for an example ```{r} library(psych) cohen.d(sat.act,"gender") # test on all the subjects cohen.d(sat.act[1:40,],"gender") #and then just the first 40 subjects #vs t.test(education ~ gender,data=sat.act) t.test(education ~ gender,data=sat.act[1:40,]) #the M.dist function is biased by sample size ``` The `cohen.d` function returns all kinds of output. We just showed the summary ```{r} d <- cohen.d(sat.act,"gender") names(d) names(d$descriptive) d$descriptive$mean ``` ## Graphically showing the `cohen.d` results `error.dots` takes the `cohen.d` output and displays it graphically. ```{r cohen.d} error.dots(d,main="effect of gender on the sat.act data") abline(v=0) #add a line to show no difference ``` ### Estimating bias by bootstrapping. The M.dist function (as used in *cohen.d*) is sensitive to the number of variables and the number of subjects. We can get a confidence interval around 0 effect size by simulation. ```{r} #First create a small function cohen.d.expected <- function(x,group,n.rep=50 ) { summary <- list() #we will put the results here n.obs <- nrow(x) observed <- cohen.d(x=x,group=group)$M.dist ind <- 1:n.obs for(i in 1:n.rep){ samp <- sample(ind,n.obs,replace=FALSE) #this is a random permutation of the order variable x[,group] <- x[samp,group] summary[[i]] <- cohen.d(x,group)$M.dist } result <- unlist(summary) mean.boot <- mean(result) sd.boot <- sd(result) result <-list(observed=observed,mean = mean.boot,sd=sd.boot,trials =result) return(result) } test <- cohen.d.expected(sat.act,group="gender", n.rep=1000) test$observed describe(test$trials,quant= c(.9,.95,.99)) hist(test$trials,breaks=41) ##do it again, but on all smaller sample test <- cohen.d.expected(sat.act[1:40,],group="gender", n.rep=1000) test$observed describe(test$trials,quant= c(.9,.95,.99)) hist(test$trials,breaks=41) ``` We see that for 700 subjects, the confidence interval around 0 for the expected $D^2$ is much less than we observe. But for 40 subjects, the $D^2$ is bigger, but the range of chance results is as well. We can not reject the hypothesis that the groups do not differ, even though the $D^2$ is bigger. # Galton, Pearson, and the correlation coefficient ```{r} library(psych) data(galton) galton.tab <- table(galton) #make a table from the data galton.tab[order(rank(rownames(galton.tab)),decreasing=TRUE),] #sort it by decreasing row va ``` ## Show it graphically ```{r} #show the scatter plot and the lowess fit pairs.panels(galton,main="Galton's Parent child heights") #but this makes the regression lines look the same pairs.panels(galton,lm=TRUE,main="Galton's Parent child heights") #better is to scale them pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),main="Galton's Parent child heights") pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights") #now add some noise pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights",jiggle=TRUE) #and even more noise pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights",jiggle=TRUE,factor=4) ``` # Lets explore multiple regression ## An example of made up data We will create some data to have certain properties. We use the `sim.structural` function from `psych` ```{r} set.seed(42) fx <-matrix(c( .9,.8,.6,rep(0,4),.6,.8,-.7),ncol=2) fy <- matrix(c(.6,.5,.4),ncol=1) rownames(fx) <- c("GREV","GREQ","GREA","nach","Anx") rownames(fy)<- c("gpa","Pre","MA") Phi <-matrix( c(1,0,.7,.0,1,.7,.7,.7,1),ncol=3) gre.gpa <- sim.structural(fx,Phi,fy,n=1000) grad.sim <- gre.gpa$observed ``` Or we can read the file from a remote server. (Commented out while flying) ```{r} datafilename="http://personality-project.org/r/datasets/psychometrics.prob2.txt" grad <- read.table(datafilename,header=TRUE) #read the data file using cor R grad <- read.file(datafilename) #using psychTools # remember to always do dim and describe to know your data #grad <- gre.gpa$observed dim(grad) describe(grad) ``` ### notice how the simulated data are z scores. We can convert them to 'typical' scores by using the `rescale` function (from `psych) ```{r} sds <- c(100,100,100,10,10,1,.5,.5) means <- c(500,500,500,50,50,10,4,3) grad.sim <- rescale(grad.sim,mean=means,sd=sds) describe(grad.sim) ``` ### How does rescale work? Look at the help menu to see what it does, or look at the code to see how it works. ```{r rescale} rescale ``` ### Now we have rescaled data, lets continue ```{r} gradq <- subset(grad,grad[2]>700)#choose the subset gradq <- as.data.frame(gradq) with(gradq,lm(GREV ~ GREQ)) #do the regression op <- par(mfrow=c(1,2)) #two panel graph with(grad,{ plot(GREV ~ GREQ,xlim=c(200,800),main='Original data', pch=16) abline(lm(GREV ~ GREQ)) }) text(300,500,'r = .46 b = .56') with(gradq,{ plot(GREV ~ GREQ,ylim=c(200,800),xlim=c(200,800),main='GRE Q > 700',pch=16) abline(lm(GREV ~ GREQ)) }) text(300,500,'r = .18 b = .50') op <- par(mfrow=c(1,1)) #switch back to one panel ``` ## another graphical display of correlations Although `pairs.panels` is useful for smaller sets of variables, a heat map is useful for larger matrices. We use `corPlot` on the `bfi` data set. ```{r} corPlot(bfi) #now, do it again, with numeric values as well corPlot(bfi,numbers=TRUE) #perhaps cleaner is to just show the lower off diagonal corPlot(bfi,numbers=TRUE,upper=FALSE) ``` We had scaled the number to reflect *significance*. Turn this option off ```{r} corPlot(bfi,numbers=TRUE,upper=FALSE,scale=FALSE) #Turn off the scaling ``` ## Show the x -axis labels in vertical mode Graphic functions have all kinds of extra parameters you can pass to them (See ?plot) We can rotate the labels on the x axis to be perpendicular to the x axis by using the las=3 option. ```{r} corPlot(bfi,numbers=TRUE,upper=FALSE,scale=FALSE,las=3) #no scaling, veritcal labels ``` ## Showing *significance* in a more conventional manner ```{r} corPlot(bfi[1:10],stars=TRUE,numbers=TRUE,upper=FALSE,diag=FALSE,las=3) ``` We can find the confidence intervals of our correlations, either by bootstrap or by normal theory. ```{r} bfi.ci <- corCi(bfi[1:10]) bfi.ci #show it corPlotUpperLowerCi(bfi.ci) ``` # Multiple Regression/multiple correlation We use the epi.bfi data set as a convenient example. ```{r epi.bfi} describe(epi.bfi) lowerCor(epi.bfi) corPlot(epi.bfi) ``` ## Multiple regression predicts Y from multiple Xs $\hat{Y}= \mu + \beta X$ Two functions to do this: lm (for linear model) in cor R requires complete data and gives very useful diagnostics. lmCor (in `psych`) does not require complete data, will work from the correlation matrix, and draws regression path diagrams. Does not give the diagnostics of lm. ```{r lm} mod <- lm(bdi ~ epiNeur + traitanx,data= epi.bfi) mod #just the coefficients summary(mod) #more useful information ``` Compare this to `lmCor' ```{r lmCor} mod1 <- lmCor(bdi ~ epiNeur + traitanx,data= epi.bfi) mod1 summary(mod1) ``` ## By default `lmCor` standardizes ```{r lmCor1} mod1 <- lmCor(bdi ~ epiNeur + traitanx,data= epi.bfi,std=FALSE) mod1 summary(mod1) ``` # Mediation and Moderation If some DVs are predicted by some IVs, but the effect is thought to go through some intervening variable, we can think of that variable as *mediating* the IV-DV relationship. For instance caffeine leads to an increase in cognitive peformance, caffeine also increases arousal and arousal is related to increased performance. This relationships may be seen in the *toy* example: ```{r} toy <- structure(c(1, 0.6, 0.36, 0.6, 1, 0.6, 0.36, 0.6, 1), .Dim = c(3L, 3L), .Dimnames = list(c("caffeine", "arousal", "performance"), c("caffeine", "arousal", "performance"))) lowerMat(toy) ``` We can test the mediation effect using the `mediate` function which treats the model as a regression model but reports on the *indirect* effects of the *mediator*. Note that mediation models logically require the possibility of mediation, but this is not part of the statistical model. We compare two mediation models. One that makes sense (arousal mediates the effect of caffeine) and one that does not (caffeine mediates the effect of arousal). ```{r} #First show it as a regression model lmCor(performance ~ caffeine, data= toy, n.obs=100) #The simple model lmCor(performance ~ caffeine + arousal, data= toy, n.obs=100) #perhaps a better model mod1 <- mediate(performance ~ caffeine + (arousal), data= toy) mod1 #show the results #But what if change the order of the variables? mod2 <- mediate(performance ~ arousal + (caffeine), data= toy,plot=FALSE) mod2 #and now show the incorrect model mediate.diagram(mod2,main="An incorrect model") ```