---
title: '350: Correlation and Regression'
author: "William Revelle"
date: "05/01/2024"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=100)
```

# Detecting multivariate outliers  

If we describe a data set (e.g. *sat.act*) we can express all scores as standard scores, z
$$z = \frac{X - \bar{X}}{\sqrt{Var_x}}$$

If the data are normally distributed, then we would expect the distribution of z to be normal and to not have many observations $>  |3|$.

But what if we have multiple variables?  We can find the *multivariate* equivalent of z by finding the *Mahalanobis Distance* which considers the squared distances from the centroid, weighting each variable so that they are independent.  We define $\Sigma$ as the variance/covariance matrix of the X variables.  The inverse of $\Sigma$ is shown as $\Sigma^{-1}$ and weights each variable by the independent parts of the variable.

The inverse of the correlation matrix partials out each variable from all the other variables.  The result is that we are considering the independent effects:

$$D^2 = ({X}_{ij} - \bar{X_j})' \Sigma^{-1} ({X}_{ij} - \bar{X_j})$$
That is,  find how much the each subject differs from the mean  differ on each variable. (This is a generalization of a deviation score from the  mean to the deviation from the multivariate mean), and then weight each difference by the inverse of the correlation of the variables. 

We will use the *outlier* function in *psych* to do this, and to plot the results.  Then we will examine how it does it.

A similar calculation is done when we find the `cohen.d` between two groups on multiple variables.


$D^2$ is analogous to the $R^2$ of multiple regression.  

```{r}
library(psych)
library(psychTools)
d2 <- outlier(sat.act)  #finds the outliers and draws a graph
describe(d2)
```

The *outlier* function finds the $D^2$ statistic and then plots it against the expected values of a $\chi^2$ distribution. (Remember that a $\chi^2$ is just a squared z score.)  The plot function used is known as a *qq plot* for a quantile by quantile plot.  It sorts the data and then plots the x and y values. 

We show this first for 1000 normal deviates and their squares, plotted against the quantiles of $\chi^2$.

```{r chisq}
set.seed(42)
n.obs <- 10000
z <- rnorm(n.obs) #n.obs  normal deviates
z2 <- z^2  #and their squares
Chi2 <- qchisq(ppoints(n.obs), df =1)
qqplot(Chi2,z2,main="A quantile plot of squared normal deviates")
abline(0, 1, col = "gray") #show the equality line
```


Now, we do it for our real data.

```{r}
n.obs <- NROW(d2)
nvar <- 6
 Chi2 <- qchisq(ppoints(n.obs), df =nvar)
 
qqplot(Chi2, d2, main = expression("Q-Q plot of Mahalanobis" * 
            ~D^2 * " vs. quantiles of" * ~chi[nvar]^2))
 abline(0, 1, col = "gray")
```

Who are these unusual people?  Join *sat.act* to the *d2* object using the *cbind* function, and plot it.

```{r}
sat.d2 <- cbind(sat.act,d2) #combine the columns to make a new variable 
#sat.d2 <- data.frame(sat.act,d2)  #this does the same
pairs.panels(sat.d2,bg=c("yellow","blue")[(d2 > 25)+1],pch=21,gap=0)
```

## Sort the data so we can see the results more clearly

The *order* function will take a vector and return the index of the order of the cells.  This can be done either in an increasing or decreasing order.

```{r}
ord <- order(d2,decreasing=TRUE)
sat.d2 <- sat.d2[ord,]   #rearrange them to be in the order of the d2 variable
headTail(sat.d2,top=10,bottom=5)
```

By inspection, it seems as if the first 4 subjects are probably incorrect, the others are somewhat questionable.

## Sort a data frame by a number of columns

The *order* function will give the order for one column.  The *dfOrder* function will do this for multiple columns at once.

```{r}
new.sat <-dfOrder(sat.act) #sort the entire file
headTail(new.sat,top=10,bottom=10)
```

# Effect size as a index of the difference of groups


 There are many ways of reporting how two groups differ. Cohen's d statistic is just the differences of means expressed in terms of the pooled within group standard deviation. This is insensitive to sample size.

  r is the a universal measure of effect size that is a simple function of d, but is bounded -1 to 1. 

 The t statistic is merely d * sqrt(n)/2 and thus reflects sample size.

 Confidence intervals for Cohen's d may be found by converting the d to a t, finding the confidence intervals for t, and then converting those back to ds. This take advantage of the uniroot function and the non-centrality parameter of the t distribution.

See `cohen.d`

We use our `sat.act`  data set for an example

```{r}
library(psych)
cohen.d(sat.act,"gender")  # test on all the subjects
cohen.d(sat.act[1:40,],"gender")  #and then just the first 40 subjects
#vs
 t.test(education ~ gender,data=sat.act)
 t.test(education ~ gender,data=sat.act[1:40,])   #the M.dist function is biased by sample size
```
The `cohen.d` function returns all kinds of output.  We just showed the summary

```{r}
d <- cohen.d(sat.act,"gender")
names(d)
names(d$descriptive)
d$descriptive$mean
```
## Graphically showing the `cohen.d` results

`error.dots` takes the `cohen.d` output and displays it graphically.

```{r cohen.d}
error.dots(d,main="effect of gender on the sat.act data")
abline(v=0) #add a line to show no difference
```

### Estimating bias by bootstrapping.

The M.dist function (as used in *cohen.d*) is sensitive to the number of variables and the number of subjects. We can get a confidence interval around 0 effect size by simulation.

```{r}
#First create a small function 
cohen.d.expected <- function(x,group,n.rep=50 ) {
  summary <- list()   #we will put the results here
  n.obs <- nrow(x)
  observed <- cohen.d(x=x,group=group)$M.dist
 ind <- 1:n.obs
  for(i in 1:n.rep){
  samp <- sample(ind,n.obs,replace=FALSE)  #this is a random permutation of the order variable
  x[,group] <- x[samp,group]
  summary[[i]] <- cohen.d(x,group)$M.dist

  }
  result <- unlist(summary)
  mean.boot <- mean(result)
  sd.boot <- sd(result)
  result <-list(observed=observed,mean = mean.boot,sd=sd.boot,trials =result)
  return(result)
}   

test <- cohen.d.expected(sat.act,group="gender", n.rep=1000)
test$observed
describe(test$trials,quant= c(.9,.95,.99))
hist(test$trials,breaks=41)
##do it again, but on all smaller sample
test <- cohen.d.expected(sat.act[1:40,],group="gender", n.rep=1000)
test$observed
describe(test$trials,quant= c(.9,.95,.99))
hist(test$trials,breaks=41)
```

We see that for 700 subjects, the confidence interval around 0 for the expected $D^2$ is much less than we observe.  But for 40 subjects, the $D^2$ is bigger, but the range of chance results is as well.  We can not reject the hypothesis that the groups do not differ, even though the $D^2$ is bigger.

# Galton, Pearson, and the correlation coefficient

```{r}
library(psych)
data(galton)
galton.tab <- table(galton) #make a table from the data
galton.tab[order(rank(rownames(galton.tab)),decreasing=TRUE),] #sort it by decreasing row va
```
## Show it graphically


```{r}
#show the scatter plot and the lowess fit 
pairs.panels(galton,main="Galton's Parent child heights")  
#but this makes the regression lines look the same
pairs.panels(galton,lm=TRUE,main="Galton's Parent child heights") 
 #better is to scale them 
pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),main="Galton's Parent child heights") 
pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights")
#now add some noise
 pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights",jiggle=TRUE)
#and even more noise
 pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights",jiggle=TRUE,factor=4)
```
# Lets explore multiple regression


## An example of made up data
  We will create some data to have certain properties.
  We use the `sim.structural` function from `psych`
  
```{r}
set.seed(42)
fx <-matrix(c( .9,.8,.6,rep(0,4),.6,.8,-.7),ncol=2)  
 fy <- matrix(c(.6,.5,.4),ncol=1)
 rownames(fx) <- c("GREV","GREQ","GREA","nach","Anx")
 rownames(fy)<- c("gpa","Pre","MA")
 Phi <-matrix( c(1,0,.7,.0,1,.7,.7,.7,1),ncol=3)
 gre.gpa <- sim.structural(fx,Phi,fy,n=1000)
 grad.sim <- gre.gpa$observed
```

Or we can read the file from a remote server. (Commented out while flying)

```{r}
 datafilename="http://personality-project.org/r/datasets/psychometrics.prob2.txt" 
grad <- read.table(datafilename,header=TRUE)  #read the data file using cor R
grad <- read.file(datafilename)  #using psychTools
# remember to always do dim and describe to know your data
#grad <- gre.gpa$observed
dim(grad)
describe(grad)
```

### notice how the simulated data are z scores.

We can convert them to 'typical' scores by using the `rescale`
function (from `psych)
```{r}
sds <- c(100,100,100,10,10,1,.5,.5)
means <- c(500,500,500,50,50,10,4,3)
grad.sim <- rescale(grad.sim,mean=means,sd=sds)
describe(grad.sim)
```
### How does rescale work?

Look at the help menu to see what it does, or look at the code to see how it works.

```{r rescale}
rescale
```

### Now we have rescaled data, lets continue

```{r}


gradq <- subset(grad,grad[2]>700)#choose the subset
gradq <- as.data.frame(gradq)
with(gradq,lm(GREV ~ GREQ)) #do the regression

op <- par(mfrow=c(1,2))  #two panel graph
 with(grad,{
 plot(GREV ~ GREQ,xlim=c(200,800),main='Original data', pch=16)
 abline(lm(GREV ~ GREQ))
 })
 text(300,500,'r = .46   b = .56')
 with(gradq,{
 plot(GREV ~ GREQ,ylim=c(200,800),xlim=c(200,800),main='GRE Q > 700',pch=16)
 abline(lm(GREV ~ GREQ))
 })
  text(300,500,'r = .18  b = .50')
 op <- par(mfrow=c(1,1)) #switch back to one panel
```


## another graphical display of correlations

Although `pairs.panels` is useful for smaller sets of variables, a heat map is useful for larger matrices.

We use `corPlot` on the `bfi` data set.

```{r}
corPlot(bfi)
#now, do it again, with numeric values as well
corPlot(bfi,numbers=TRUE)
#perhaps cleaner is to just show the lower off diagonal
corPlot(bfi,numbers=TRUE,upper=FALSE)
```

We had scaled the number to reflect *significance*.  Turn this option off

```{r}
corPlot(bfi,numbers=TRUE,upper=FALSE,scale=FALSE)  #Turn off the scaling
```

## Show the x -axis labels in vertical mode

Graphic functions have all kinds of extra parameters you can pass to them (See ?plot)
We can rotate the labels on the x axis to be perpendicular to the x axis by using the las=3 option.

```{r}
corPlot(bfi,numbers=TRUE,upper=FALSE,scale=FALSE,las=3)  #no scaling, veritcal labels
```

## Showing *significance* in a more conventional manner

```{r}
corPlot(bfi[1:10],stars=TRUE,numbers=TRUE,upper=FALSE,diag=FALSE,las=3)
```

We can find the confidence intervals of our correlations, either by bootstrap or by normal theory.

```{r}
bfi.ci <- corCi(bfi[1:10])
bfi.ci  #show it
corPlotUpperLowerCi(bfi.ci)
```


# Multiple Regression/multiple correlation

We use the epi.bfi data set as a convenient example.

```{r epi.bfi}
describe(epi.bfi)
lowerCor(epi.bfi)
corPlot(epi.bfi)
```

## Multiple regression predicts Y from multiple Xs

$\hat{Y}= \mu + \beta X$

Two functions to do this:

lm  (for linear model) in cor R  requires complete data and gives very useful diagnostics.

lmCor (in `psych`) does not require complete data, will work from the correlation matrix, and draws regression path diagrams.  Does not give the diagnostics of lm.

```{r lm}
mod <- lm(bdi ~ epiNeur + traitanx,data= epi.bfi)
mod  #just the coefficients
summary(mod)  #more useful information
```

Compare this to `lmCor'

```{r lmCor}
mod1 <- lmCor(bdi ~ epiNeur + traitanx,data= epi.bfi)
mod1
summary(mod1)
```

##  By default `lmCor` standardizes

```{r lmCor1}
mod1 <- lmCor(bdi ~ epiNeur + traitanx,data= epi.bfi,std=FALSE)
mod1
summary(mod1)
```

# Mediation and Moderation


If some DVs are predicted by some IVs, but the effect is thought to go through some intervening variable, we can think of that variable as *mediating* the IV-DV relationship. For instance caffeine leads to an increase in cognitive peformance, caffeine also increases arousal and arousal is related to increased performance.  This relationships may be seen in the *toy* example:

```{r}
toy <- structure(c(1, 0.6, 0.36, 0.6, 1, 0.6, 0.36, 0.6, 1), .Dim = c(3L, 
3L), .Dimnames = list(c("caffeine", "arousal", "performance"), 
    c("caffeine", "arousal", "performance")))
lowerMat(toy)
```

We can test the mediation effect using the `mediate` function which treats   the model as a regression model but reports on the *indirect* effects of the *mediator*.  Note that mediation models logically require the possibility of mediation, but this is not part of the statistical model.

We compare two mediation models.  One that makes sense (arousal mediates the effect of caffeine) and one that does not (caffeine mediates the effect of arousal).

```{r}
#First show it as a regression model
lmCor(performance ~ caffeine, data= toy, n.obs=100)  #The simple model
lmCor(performance ~ caffeine + arousal, data= toy, n.obs=100) #perhaps a better model

mod1 <- mediate(performance ~ caffeine + (arousal), data= toy)
mod1  #show the results

#But what if change the order of the variables?
mod2 <- mediate(performance ~ arousal + (caffeine), data= toy,plot=FALSE)
mod2  #and now show the incorrect model
mediate.diagram(mod2,main="An incorrect model")
```