Detecting multivariate outliers

If we describe a data set (e.g. sat.act) we can express all scores as standard scores, z \[z = \frac{X - \bar{X}}{\sqrt{Var_x}}\]

If the data are normally distributed, then we would expect the distribution of z to be normal and to not have many observations \(> |3|\).

But what if we have multiple variables? We can find the multivariate equivalent of z by finding the Mahalanobis Distance which considers the squared distances from the centroid, weighting each variable so that they are independent. We define \(\Sigma\) as the variance/covariance matrix of the X variables. The inverse of \(\Sigma\) is shown as \(\Sigma^{-1}\) and weights each variable by the independent parts of the variable.

The inverse of the correlation matrix partials out each variable from all the other variables. The result is that we are considering the independent effects:

\[D^2 = ({X}_{ij} - \bar{X_j})' \Sigma^{-1} ({X}_{ij} - \bar{X_j})\] That is, find how much the each subject differs from the mean differ on each variable. (This is a generalization of a deviation score from the mean to the deviation from the multivariate mean), and then weight each difference by the inverse of the correlation of the variables.

We will use the outlier function in psych to do this, and to plot the results. Then we will examine how it does it.

A similar calculation is done when we find the cohen.d between two groups on multiple variables.

\(D^2\) is analogous to the \(R^2\) of multiple regression.

library(psych)
library(psychTools)
d2 <- outlier(sat.act)  #finds the outliers and draws a graph

describe(d2)

##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 700 5.99 4.59   4.86    5.25 2.86 0.71 42.3  41.6 2.75    12.19 0.17

The outlier function finds the \(D^2\) statistic and then plots it against the expected values of a \(\chi^2\) distribution. (Remember that a \(\chi^2\) is just a squared z score.) The plot function used is known as a qq plot for a quantile by quantile plot. It sorts the data and then plots the x and y values.

We show this first for 1000 normal deviates and their squares, plotted against the quantiles of \(\chi^2\).

set.seed(42)
n.obs <- 10000
z <- rnorm(n.obs) #n.obs  normal deviates
z2 <- z^2  #and their squares
Chi2 <- qchisq(ppoints(n.obs), df =1)
qqplot(Chi2,z2,main="A quantile plot of squared normal deviates")
abline(0, 1, col = "gray") #show the equality line

Now, we do it for our real data.

n.obs <- NROW(d2)
nvar <- 6
 Chi2 <- qchisq(ppoints(n.obs), df =nvar)
 
qqplot(Chi2, d2, main = expression("Q-Q plot of Mahalanobis" * 
            ~D^2 * " vs. quantiles of" * ~chi[nvar]^2))
 abline(0, 1, col = "gray")

Who are these unusual people? Join sat.act to the d2 object using the cbind function, and plot it.

sat.d2 <- cbind(sat.act,d2) #combine the columns to make a new variable 
#sat.d2 <- data.frame(sat.act,d2)  #this does the same
pairs.panels(sat.d2,bg=c("yellow","blue")[(d2 > 25)+1],pch=21,gap=0)

Sort the data so we can see the results more clearly

The order function will take a vector and return the index of the order of the cells. This can be done either in an increasing or decreasing order.

ord <- order(d2,decreasing=TRUE)
sat.d2 <- sat.d2[ord,]   #rearrange them to be in the order of the d2 variable
headTail(sat.d2,top=10,bottom=5)

##       gender education age ACT SATV SATQ    d2
## 33337      2         3  23  35  200  200  42.3
## 36093      1         2  22   3  200  400 35.12
## 35186      2         1  17  35  350  230 34.65
## 30980      2         5  65  23  700  650 28.04
## 38502      2         5  61  30  300  350 26.29
## 35088      2         2  57  35  650  750 25.93
## 31299      1         1  45  35  450  500 23.99
## 30899      2         3  20  30  240  600 23.53
## 33547      1         2  55  25  550  700 23.51
## 29728      2         2  55  28  620  450 22.82
## ...      ...       ... ... ...  ...  ...   ...
## 39242      2         3  21  27  610 <NA>  0.94
## 35949      2         3  22  30  650  650   0.9
## 30455      2         4  29  29  600  600  0.86
## 33130      2         3  22  27  570  570  0.82
## 38899      2         3  22  28  600  600  0.71

By inspection, it seems as if the first 4 subjects are probably incorrect, the others are somewhat questionable.

Sort a data frame by a number of columns

The order function will give the order for one column. The dfOrder function will do this for multiple columns at once.

new.sat <-dfOrder(sat.act) #sort the entire file
headTail(new.sat,top=10,bottom=10)

##       gender education age ACT SATV SATQ
## 34895      1         0  14  30  600  600
## 37071      1         0  15  32  650  450
## 37067      1         0  16  20  450  450
## 34731      1         0  16  24  540  540
## 31730      1         0  16  24  800  800
## 31741      1         0  16  26  800  800
## 34284      1         0  16  35  700  750
## 37357      1         0  16  36  765  677
## 30444      1         0  17  22  490  530
## 36703      1         0  17  28  420  530
## ...      ...       ... ... ...  ...  ...
## 30030      2         5  52  24  630  570
## 37317      2         5  56  34  730  700
## 31367      2         5  57  23  650  400
## 31982      2         5  57  32  603  695
## 31758      2         5  58  26  740  700
## 32406      2         5  58  27  550  575
## 38502      2         5  61  30  300  350
## 31454      2         5  61  31  620  700
## 31376      2         5  62  36  750  790
## 30980      2         5  65  23  700  650

Effect size as a index of the difference of groups

There are many ways of reporting how two groups differ. Cohen’s d statistic is just the differences of means expressed in terms of the pooled within group standard deviation. This is insensitive to sample size.

r is the a universal measure of effect size that is a simple function of d, but is bounded -1 to 1.

The t statistic is merely d * sqrt(n)/2 and thus reflects sample size.

Confidence intervals for Cohen’s d may be found by converting the d to a t, finding the confidence intervals for t, and then converting those back to ds. This take advantage of the uniroot function and the non-centrality parameter of the t distribution.

See cohen.d

We use our sat.act data set for an example

library(psych)
cohen.d(sat.act,"gender")  # test on all the subjects

## Call: cohen.d(x = sat.act, group = "gender")
## Cohen d statistic of difference between two means
##           lower effect upper
## education  0.03   0.18  0.34
## age       -0.20  -0.04  0.11
## ACT       -0.23  -0.08  0.08
## SATV      -0.19  -0.04  0.12
## SATQ      -0.51  -0.35 -0.19
## 
## Multivariate (Mahalanobis) distance between groups
## [1] 0.52
## r equivalent of difference between two means
## education       age       ACT      SATV      SATQ 
##      0.09     -0.02     -0.04     -0.02     -0.17

cohen.d(sat.act[1:40,],"gender")  #and then just the first 40 subjects

## Call: cohen.d(x = sat.act[1:40, ], group = "gender")
## Cohen d statistic of difference between two means
##           lower effect upper
## education -0.15   0.49  1.12
## age       -0.60   0.03  0.65
## ACT       -0.49   0.14  0.76
## SATV      -0.61   0.02  0.64
## SATQ      -1.10  -0.47  0.17
## 
## Multivariate (Mahalanobis) distance between groups
## [1] 0.82
## r equivalent of difference between two means
## education       age       ACT      SATV      SATQ 
##      0.24      0.01      0.07      0.01     -0.23

#vs
 t.test(education ~ gender,data=sat.act)

## 
##  Welch Two Sample t-test
## 
## data:  education by gender
## t = -2.2299, df = 453.96, p-value = 0.02624
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -0.48935928 -0.03087916
## sample estimates:
## mean in group 1 mean in group 2 
##        2.995951        3.256071

 t.test(education ~ gender,data=sat.act[1:40,])   #the M.dist function is biased by sample size

## 
##  Welch Two Sample t-test
## 
## data:  education by gender
## t = -1.4356, df = 28.257, p-value = 0.1621
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -1.3601898  0.2389777
## sample estimates:
## mean in group 1 mean in group 2 
##        3.166667        3.727273

The cohen.d function returns all kinds of output. We just showed the summary

d <- cohen.d(sat.act,"gender")
names(d)

##  [1] "cohen.d"     "hedges.g"    "M.dist"      "r"           "t"           "n"          
##  [7] "p"           "wt.d"        "descriptive" "se"          "dict"        "order"      
## [13] "Call"

names(d$descriptive)

##  [1] "mean"  "sd"    "n"     "F"     "ICC1"  "ICC2"  "ci1"   "ci2"   "raw"   "rbg"   "pbg"   "rwg"  
## [13] "nw"    "ci.wg" "pwg"   "etabg" "etawg" "nwg"   "nG"    "Call"

d$descriptive$mean

##   gender education      age      ACT     SATV     SATQ
## 1      1  2.995951 25.86235 28.78543 615.1134 635.8735
## 2      2  3.256071 25.44812 28.41722 610.6645 595.9955

Graphically showing the `cohen.d` results

error.dots takes the cohen.d output and displays it graphically.

error.dots(d,main="effect of gender on the sat.act data")
abline(v=0) #add a line to show no difference

Estimating bias by bootstrapping.

The M.dist function (as used in cohen.d) is sensitive to the number of variables and the number of subjects. We can get a confidence interval around 0 effect size by simulation.

#First create a small function 
cohen.d.expected <- function(x,group,n.rep=50 ) {
  summary <- list()   #we will put the results here
  n.obs <- nrow(x)
  observed <- cohen.d(x=x,group=group)$M.dist
 ind <- 1:n.obs
  for(i in 1:n.rep){
  samp <- sample(ind,n.obs,replace=FALSE)  #this is a random permutation of the order variable
  x[,group] <- x[samp,group]
  summary[[i]] <- cohen.d(x,group)$M.dist

  }
  result <- unlist(summary)
  mean.boot <- mean(result)
  sd.boot <- sd(result)
  result <-list(observed=observed,mean = mean.boot,sd=sd.boot,trials =result)
  return(result)
}   

test <- cohen.d.expected(sat.act,group="gender", n.rep=1000)
test$observed

## [1] 0.5158626

describe(test$trials,quant= c(.9,.95,.99))

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se Q0.9 Q0.95 Q0.99
## X1    1 1000 0.17 0.06   0.17    0.17 0.06 0.04 0.38  0.34 0.42     0.05  0 0.25  0.27  0.31

hist(test$trials,breaks=41)

##do it again, but on all smaller sample
test <- cohen.d.expected(sat.act[1:40,],group="gender", n.rep=1000)
test$observed

## [1] 0.8241425

describe(test$trials,quant= c(.9,.95,.99))

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis   se Q0.9 Q0.95 Q0.99
## X1    1 1000 0.76 0.26   0.73    0.75 0.26 0.16 1.97  1.81 0.56      0.3 0.01 1.12  1.24  1.43

hist(test$trials,breaks=41)

We see that for 700 subjects, the confidence interval around 0 for the expected \(D^2\) is much less than we observe. But for 40 subjects, the \(D^2\) is bigger, but the range of chance results is as well. We can not reject the hypothesis that the groups do not differ, even though the \(D^2\) is bigger.

Galton, Pearson, and the correlation coefficient

library(psych)
data(galton)
galton.tab <- table(galton) #make a table from the data
galton.tab[order(rank(rownames(galton.tab)),decreasing=TRUE),] #sort it by decreasing row va

##       child
## parent 61.7 62.2 63.2 64.2 65.2 66.2 67.2 68.2 69.2 70.2 71.2 72.2 73.2 73.7
##   73      0    0    0    0    0    0    0    0    0    0    0    1    3    0
##   72.5    0    0    0    0    0    0    0    1    2    1    2    7    2    4
##   71.5    0    0    0    0    1    3    4    3    5   10    4    9    2    2
##   70.5    1    0    1    0    1    1    3   12   18   14    7    4    3    3
##   69.5    0    0    1   16    4   17   27   20   33   25   20   11    4    5
##   68.5    1    0    7   11   16   25   31   34   48   21   18    4    3    0
##   67.5    0    3    5   14   15   36   38   28   38   19   11    4    0    0
##   66.5    0    3    3    5    2   17   17   14   13    4    0    0    0    0
##   65.5    1    0    9    5    7   11   11    7    7    5    2    1    0    0
##   64.5    1    1    4    4    1    5    5    0    2    0    0    0    0    0
##   64      1    0    2    4    1    2    2    1    1    0    0    0    0    0

Show it graphically

#show the scatter plot and the lowess fit 
pairs.panels(galton,main="Galton's Parent child heights")

#but this makes the regression lines look the same
pairs.panels(galton,lm=TRUE,main="Galton's Parent child heights")

 #better is to scale them 
pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),main="Galton's Parent child heights")

pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights")

#now add some noise
 pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights",jiggle=TRUE)

#and even more noise
 pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights",jiggle=TRUE,factor=4)

# Lets explore multiple regression

An example of made up data

We will create some data to have certain properties. We use the sim.structural function from psych

set.seed(42)
fx <-matrix(c( .9,.8,.6,rep(0,4),.6,.8,-.7),ncol=2)  
 fy <- matrix(c(.6,.5,.4),ncol=1)
 rownames(fx) <- c("GREV","GREQ","GREA","nach","Anx")
 rownames(fy)<- c("gpa","Pre","MA")
 Phi <-matrix( c(1,0,.7,.0,1,.7,.7,.7,1),ncol=3)
 gre.gpa <- sim.structural(fx,Phi,fy,n=1000)
 grad.sim <- gre.gpa$observed

Or we can read the file from a remote server. (Commented out while flying)

 datafilename="http://personality-project.org/r/datasets/psychometrics.prob2.txt" 
grad <- read.table(datafilename,header=TRUE)  #read the data file using cor R
grad <- read.file(datafilename)  #using psychTools

## Data from the .txt file http://personality-project.org/r/datasets/psychometrics.prob2.txt has been loaded.

# remember to always do dim and describe to know your data
#grad <- gre.gpa$observed
dim(grad)

## [1] 1000    9

describe(grad)

##        vars    n   mean     sd median trimmed    mad   min     max  range  skew kurtosis   se
## ID        1 1000 500.50 288.82 500.50  500.50 370.65   1.0 1000.00 999.00  0.00    -1.20 9.13
## GREV      2 1000 499.77 106.11 497.50  498.74 106.01 138.0  873.00 735.00  0.09    -0.07 3.36
## GREQ      3 1000 500.53 103.85 498.00  498.51 105.26 191.0  914.00 723.00  0.22     0.08 3.28
## GREA      4 1000 498.13 100.45 495.00  498.67  99.33 207.0  848.00 641.00 -0.02    -0.06 3.18
## Ach       5 1000  49.93   9.84  50.00   49.88  10.38  16.0   79.00  63.00  0.00     0.02 0.31
## Anx       6 1000  50.32   9.91  50.00   50.43  10.38  14.0   78.00  64.00 -0.14     0.14 0.31
## Prelim    7 1000  10.03   1.06  10.00   10.02   1.48   7.0   13.00   6.00 -0.02    -0.01 0.03
## GPA       8 1000   4.00   0.50   4.02    4.01   0.53   2.5    5.38   2.88 -0.07    -0.29 0.02
## MA        9 1000   3.00   0.49   3.00    3.00   0.44   1.4    4.50   3.10 -0.07    -0.09 0.02

notice how the simulated data are z scores.

We can convert them to ‘typical’ scores by using the rescale function (from `psych)

sds <- c(100,100,100,10,10,1,.5,.5)
means <- c(500,500,500,50,50,10,4,3)
grad.sim <- rescale(grad.sim,mean=means,sd=sds)
describe(grad.sim)

##      vars    n mean    sd median trimmed    mad    min    max  range  skew kurtosis   se
## GREV    1 1000  500 100.0 498.53  499.31 100.39 186.68 844.79 658.11  0.08    -0.08 3.16
## GREQ    2 1000  500 100.0 493.67  499.42  99.44 230.65 836.79 606.14  0.07    -0.11 3.16
## GREA    3 1000  500 100.0 498.14  500.11  98.79 124.46 801.40 676.95 -0.01     0.07 3.16
## nach    4 1000   50  10.0  49.89   49.97   9.84  21.35  79.39  58.03  0.01    -0.05 0.32
## Anx     5 1000   50  10.0  49.92   49.96   9.63  16.04  82.15  66.11  0.01     0.32 0.32
## gpa     6 1000   10   1.0  10.02   10.00   1.04   6.87  12.98   6.12  0.01    -0.20 0.03
## Pre     7 1000    4   0.5   4.00    4.00   0.51   2.12   5.68   3.57  0.00     0.00 0.02
## MA      8 1000    3   0.5   3.00    3.00   0.50   1.17   4.64   3.47 -0.09     0.09 0.02

How does rescale work?

Look at the help menu to see what it does, or look at the code to see how it works.

rescale

## function (x, mean = 100, sd = 15, df = TRUE) 
## {
##     if (df) {
##         x <- data.frame(t(t(scale(x)) * sd + mean))
##     }
##     else {
##         x <- t(t(scale(x)) * sd + mean)
##     }
##     return(x)
## }
## <bytecode: 0x11bac9e08>
## <environment: namespace:psych>

Now we have rescaled data, lets continue

gradq <- subset(grad,grad[2]>700)#choose the subset
gradq <- as.data.frame(gradq)
with(gradq,lm(GREV ~ GREQ)) #do the regression

## 
## Call:
## lm(formula = GREV ~ GREQ)
## 
## Coefficients:
## (Intercept)         GREQ  
##    583.3783       0.2307

op <- par(mfrow=c(1,2))  #two panel graph
 with(grad,{
 plot(GREV ~ GREQ,xlim=c(200,800),main='Original data', pch=16)
 abline(lm(GREV ~ GREQ))
 })
 text(300,500,'r = .46   b = .56')
 with(gradq,{
 plot(GREV ~ GREQ,ylim=c(200,800),xlim=c(200,800),main='GRE Q > 700',pch=16)
 abline(lm(GREV ~ GREQ))
 })
  text(300,500,'r = .18  b = .50')

 op <- par(mfrow=c(1,1)) #switch back to one panel

another graphical display of correlations

Although pairs.panels is useful for smaller sets of variables, a heat map is useful for larger matrices.

We use corPlot on the bfi data set.

corPlot(bfi)
#now, do it again, with numeric values as well
corPlot(bfi,numbers=TRUE)

#perhaps cleaner is to just show the lower off diagonal
corPlot(bfi,numbers=TRUE,upper=FALSE)

We had scaled the number to reflect significance. Turn this option off

corPlot(bfi,numbers=TRUE,upper=FALSE,scale=FALSE)  #Turn off the scaling

Show the x -axis labels in vertical mode

Graphic functions have all kinds of extra parameters you can pass to them (See ?plot) We can rotate the labels on the x axis to be perpendicular to the x axis by using the las=3 option.

corPlot(bfi,numbers=TRUE,upper=FALSE,scale=FALSE,las=3)  #no scaling, veritcal labels

Showing significance in a more conventional manner

corPlot(bfi[1:10],stars=TRUE,numbers=TRUE,upper=FALSE,diag=FALSE,las=3)

We can find the confidence intervals of our correlations, either by bootstrap or by normal theory.

bfi.ci <- corCi(bfi[1:10])

bfi.ci  #show it

## Call:corCi(x = bfi[1:10])
## 
##  Coefficients and bootstrapped confidence intervals 
##    A1    A2    A3    A4    A5    C1    C2    C3    C4    C5   
## A1  1.00                                                      
## A2 -0.34  1.00                                                
## A3 -0.27  0.49  1.00                                          
## A4 -0.15  0.34  0.36  1.00                                    
## A5 -0.18  0.39  0.50  0.31  1.00                              
## C1  0.03  0.09  0.10  0.09  0.12  1.00                        
## C2  0.02  0.14  0.14  0.23  0.11  0.43  1.00                  
## C3 -0.02  0.19  0.13  0.13  0.13  0.31  0.36  1.00            
## C4  0.13 -0.15 -0.12 -0.15 -0.13 -0.34 -0.38 -0.34  1.00      
## C5  0.05 -0.12 -0.16 -0.24 -0.17 -0.25 -0.30 -0.34  0.48  1.00
## 
##  scale correlations and bootstrapped confidence intervals 
##       lower.emp lower.norm estimate upper.norm upper.emp    p
## A1-A2     -0.37      -0.37    -0.34      -0.30     -0.30 0.00
## A1-A3     -0.30      -0.30    -0.27      -0.22     -0.22 0.00
## A1-A4     -0.19      -0.19    -0.15      -0.11     -0.11 0.00
## A1-A5     -0.21      -0.22    -0.18      -0.14     -0.13 0.00
## A1-C1     -0.01      -0.01     0.03       0.06      0.06 0.18
## A1-C2     -0.02      -0.02     0.02       0.05      0.05 0.44
## A1-C3     -0.06      -0.06    -0.02       0.02      0.03 0.37
## A1-C4      0.09       0.09     0.13       0.17      0.17 0.00
## A1-C5      0.01       0.01     0.05       0.09      0.10 0.02
## A2-A3      0.45       0.45     0.49       0.52      0.52 0.00
## A2-A4      0.30       0.29     0.34       0.37      0.37 0.00
## A2-A5      0.34       0.35     0.39       0.42      0.42 0.00
## A2-C1      0.06       0.05     0.09       0.14      0.14 0.00
## A2-C2      0.08       0.09     0.14       0.18      0.18 0.00
## A2-C3      0.14       0.15     0.19       0.24      0.24 0.00
## A2-C4     -0.18      -0.18    -0.15      -0.11     -0.11 0.00
## A2-C5     -0.16      -0.16    -0.12      -0.08     -0.08 0.00
## A3-A4      0.33       0.33     0.36       0.39      0.40 0.00
## A3-A5      0.47       0.47     0.50       0.54      0.54 0.00
## A3-C1      0.06       0.05     0.10       0.14      0.15 0.00
## A3-C2      0.11       0.10     0.14       0.18      0.18 0.00
## A3-C3      0.09       0.09     0.13       0.17      0.17 0.00
## A3-C4     -0.16      -0.16    -0.12      -0.08     -0.09 0.00
## A3-C5     -0.19      -0.20    -0.16      -0.12     -0.11 0.00
## A4-A5      0.27       0.27     0.31       0.34      0.34 0.00
## A4-C1      0.05       0.05     0.09       0.13      0.12 0.00
## A4-C2      0.19       0.19     0.23       0.27      0.27 0.00
## A4-C3      0.09       0.09     0.13       0.17      0.17 0.00
## A4-C4     -0.18      -0.19    -0.15      -0.12     -0.12 0.00
## A4-C5     -0.28      -0.28    -0.24      -0.20     -0.21 0.00
## A5-C1      0.08       0.08     0.12       0.16      0.16 0.00
## A5-C2      0.08       0.07     0.11       0.15      0.15 0.00
## A5-C3      0.09       0.09     0.13       0.18      0.17 0.00
## A5-C4     -0.16      -0.16    -0.13      -0.09     -0.09 0.00
## A5-C5     -0.22      -0.21    -0.17      -0.12     -0.12 0.00
## C1-C2      0.38       0.38     0.43       0.47      0.47 0.00
## C1-C3      0.26       0.26     0.31       0.35      0.35 0.00
## C1-C4     -0.38      -0.38    -0.34      -0.29     -0.30 0.00
## C1-C5     -0.29      -0.29    -0.25      -0.20     -0.20 0.00
## C2-C3      0.32       0.32     0.36       0.40      0.40 0.00
## C2-C4     -0.42      -0.42    -0.38      -0.34     -0.34 0.00
## C2-C5     -0.33      -0.34    -0.30      -0.26     -0.26 0.00
## C3-C4     -0.38      -0.38    -0.34      -0.29     -0.28 0.00
## C3-C5     -0.38      -0.38    -0.34      -0.31     -0.31 0.00
## C4-C5      0.44       0.44     0.48       0.51      0.51 0.00

corPlotUpperLowerCi(bfi.ci)

Multiple Regression/multiple correlation

We use the epi.bfi data set as a convenient example.

describe(epi.bfi)

##          vars   n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## epiE        1 231  13.33  4.14     14   13.49  4.45   1  22    21 -0.33    -0.06 0.27
## epiS        2 231   7.58  2.69      8    7.77  2.97   0  13    13 -0.57    -0.02 0.18
## epiImp      3 231   4.37  1.88      4    4.36  1.48   0   9     9  0.06    -0.62 0.12
## epilie      4 231   2.38  1.50      2    2.27  1.48   0   7     7  0.66     0.24 0.10
## epiNeur     5 231  10.41  4.90     10   10.39  4.45   0  23    23  0.06    -0.50 0.32
## bfagree     6 231 125.00 18.14    126  125.26 17.79  74 167    93 -0.21    -0.27 1.19
## bfcon       7 231 113.25 21.88    114  113.42 22.24  53 178   125 -0.02     0.23 1.44
## bfext       8 231 102.18 26.45    104  102.99 22.24   8 168   160 -0.41     0.51 1.74
## bfneur      9 231  87.97 23.34     90   87.70 23.72  34 152   118  0.07    -0.55 1.54
## bfopen     10 231 123.43 20.51    125  123.78 20.76  73 173   100 -0.16    -0.16 1.35
## bdi        11 231   6.78  5.78      6    5.97  4.45   0  27    27  1.29     1.50 0.38
## traitanx   12 231  39.01  9.52     38   38.36  8.90  22  71    49  0.67     0.47 0.63
## stateanx   13 231  39.85 11.48     38   38.92 10.38  21  79    58  0.72    -0.01 0.76

lowerCor(epi.bfi)

##          epiE  epiS  epImp epili epiNr bfagr bfcon bfext bfner bfopn bdi   trtnx sttnx
## epiE      1.00                                                                        
## epiS      0.85  1.00                                                                  
## epiImp    0.80  0.43  1.00                                                            
## epilie   -0.22 -0.05 -0.24  1.00                                                      
## epiNeur  -0.18 -0.22 -0.07 -0.25  1.00                                                
## bfagree   0.18  0.20  0.08  0.17 -0.08  1.00                                          
## bfcon    -0.11  0.05 -0.24  0.23 -0.13  0.45  1.00                                    
## bfext     0.54  0.58  0.35 -0.04 -0.17  0.48  0.27  1.00                              
## bfneur   -0.09 -0.07 -0.09 -0.22  0.63 -0.04  0.04  0.04  1.00                        
## bfopen    0.14  0.15  0.07 -0.03  0.09  0.39  0.31  0.46  0.29  1.00                  
## bdi      -0.16 -0.13 -0.11 -0.20  0.58 -0.14 -0.18 -0.14  0.47 -0.08  1.00            
## traitanx -0.23 -0.26 -0.12 -0.23  0.73 -0.31 -0.29 -0.39  0.59 -0.11  0.65  1.00      
## stateanx -0.13 -0.12 -0.09 -0.15  0.49 -0.19 -0.14 -0.15  0.49 -0.04  0.61  0.57  1.00

corPlot(epi.bfi)

Multiple regression predicts Y from multiple Xs

\(\hat{Y}= \mu + \beta X\)

Two functions to do this:

lm (for linear model) in cor R requires complete data and gives very useful diagnostics.

lmCor (in psych) does not require complete data, will work from the correlation matrix, and draws regression path diagrams. Does not give the diagnostics of lm.

mod <- lm(bdi ~ epiNeur + traitanx,data= epi.bfi)
mod  #just the coefficients

## 
## Call:
## lm(formula = bdi ~ epiNeur + traitanx, data = epi.bfi)
## 
## Coefficients:
## (Intercept)      epiNeur     traitanx  
##     -7.6378       0.2551       0.3015

summary(mod)  #more useful information

## 
## Call:
## lm(formula = bdi ~ epiNeur + traitanx, data = epi.bfi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0288  -2.6848  -0.5121   1.9823  13.3081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.63779    1.24729  -6.124 3.97e-09 ***
## epiNeur      0.25506    0.08448   3.019  0.00282 ** 
## traitanx     0.30151    0.04347   6.935 4.13e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.299 on 228 degrees of freedom
## Multiple R-squared:  0.4507, Adjusted R-squared:  0.4459 
## F-statistic: 93.53 on 2 and 228 DF,  p-value: < 2.2e-16

Compare this to `lmCor’

mod1 <- lmCor(bdi ~ epiNeur + traitanx,data= epi.bfi)

mod1

## Call: lmCor(y = bdi ~ epiNeur + traitanx, data = epi.bfi)
## 
## Multiple Regression from raw data 
## 
##  DV =  bdi 
##             slope   se    t       p lower.ci upper.ci  VIF Vy.x
## (Intercept)  0.00 0.05 0.00 1.0e+00    -0.10     0.10 1.00 0.00
## epiNeur      0.22 0.07 3.02 2.8e-03     0.08     0.36 2.13 0.13
## traitanx     0.50 0.07 6.94 4.1e-11     0.36     0.64 2.13 0.33
## 
## Residual Standard Error =  0.74  with  228  degrees of freedom
## 
##  Multiple Regression
##        R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2        p
## bdi 0.67 0.45 0.66 0.44        0.45     0.05     93.53   2 228 2.19e-30

summary(mod1)

## 
## Multiple Regression from raw data 
## lmCor(y = bdi ~ epiNeur + traitanx, data = epi.bfi)
## 
## Multiple Regression from matrix input 
## 
## Beta weights 
##              bdi
## (Intercept) 0.00
## epiNeur     0.22
## traitanx    0.50
## 
## Multiple R 
##  bdi 
## 0.67 
## 
## Multiple R2 
##  bdi 
## 0.45 
## 
## Cohen's set correlation R2 
## [1] 0.45
## 
## Squared Canonical Correlations
## NULL

By default `lmCor` standardizes

mod1 <- lmCor(bdi ~ epiNeur + traitanx,data= epi.bfi,std=FALSE)

mod1

## Call: lmCor(y = bdi ~ epiNeur + traitanx, data = epi.bfi, std = FALSE)
## 
## Multiple Regression from raw data 
## 
##  DV =  bdi 
##             slope   se     t       p lower.ci upper.ci  VIF Vy.x
## (Intercept) -7.64 1.25 -6.12 4.0e-09   -10.10    -5.18 1.00 0.00
## epiNeur      0.26 0.08  3.02 2.8e-03     0.09     0.42 2.13 0.13
## traitanx     0.30 0.04  6.94 4.1e-11     0.22     0.39 2.13 0.33
## 
## Residual Standard Error =  4.3  with  228  degrees of freedom
## 
##  Multiple Regression
##        R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2        p
## bdi 0.67 0.45 0.58 0.34        0.45     0.05     93.53   2 228 2.19e-30

summary(mod1)

## 
## Multiple Regression from raw data 
## lmCor(y = bdi ~ epiNeur + traitanx, data = epi.bfi, std = FALSE)
## 
## Multiple Regression from matrix input 
## 
## Beta weights 
##               bdi
## (Intercept) -7.64
## epiNeur      0.26
## traitanx     0.30
## 
## Multiple R 
##  bdi 
## 0.67 
## 
## Multiple R2 
##  bdi 
## 0.45 
## 
## Cohen's set correlation R2 
## [1] -50068
## 
## Squared Canonical Correlations
## NULL

Mediation and Moderation

If some DVs are predicted by some IVs, but the effect is thought to go through some intervening variable, we can think of that variable as mediating the IV-DV relationship. For instance caffeine leads to an increase in cognitive peformance, caffeine also increases arousal and arousal is related to increased performance. This relationships may be seen in the toy example:

toy <- structure(c(1, 0.6, 0.36, 0.6, 1, 0.6, 0.36, 0.6, 1), .Dim = c(3L, 
3L), .Dimnames = list(c("caffeine", "arousal", "performance"), 
    c("caffeine", "arousal", "performance")))
lowerMat(toy)

##             caffn arosl prfrm
## caffeine    1.00             
## arousal     0.60  1.00       
## performance 0.36  0.60  1.00

We can test the mediation effect using the mediate function which treats the model as a regression model but reports on the indirect effects of the mediator. Note that mediation models logically require the possibility of mediation, but this is not part of the statistical model.

We compare two mediation models. One that makes sense (arousal mediates the effect of caffeine) and one that does not (caffeine mediates the effect of arousal).

#First show it as a regression model
lmCor(performance ~ caffeine, data= toy, n.obs=100)  #The simple model

## Call: lmCor(y = performance ~ caffeine, data = toy, n.obs = 100)
## 
## Multiple Regression from matrix input 
## 
##  DV =  performance 
##          slope   se    t       p lower.ci upper.ci VIF Vy.x
## caffeine  0.36 0.09 3.82 0.00023     0.17     0.55   1 0.13
## 
## Residual Standard Error =  0.94  with  98  degrees of freedom
## 
##  Multiple Regression
##                R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2        p
## performance 0.36 0.13 0.36 0.13        0.12     0.06     14.59   1  98 0.000234

lmCor(performance ~ caffeine + arousal, data= toy, n.obs=100) #perhaps a better model

## Call: lmCor(y = performance ~ caffeine + arousal, data = toy, n.obs = 100)
## 
## Multiple Regression from matrix input 
## 
##  DV =  performance 
##          slope  se    t       p lower.ci upper.ci  VIF Vy.x
## caffeine   0.0 0.1 0.00 1.0e+00     -0.2      0.2 1.56 0.00
## arousal    0.6 0.1 5.91 5.1e-08      0.4      0.8 1.56 0.36
## 
## Residual Standard Error =  0.81  with  97  degrees of freedom
## 
##  Multiple Regression
##               R   R2  Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2        p
## performance 0.6 0.36 0.54 0.29        0.35     0.07     27.28   2  97 3.98e-10

mod1 <- mediate(performance ~ caffeine + (arousal), data= toy)

## The data matrix was a correlation matrix and the number of subjects was not specified. 
##  n.obs arbitrarily set to 1000

## The replication data matrices were simulated based upon the specified number of subjects and the observed correlation matrix.

mod1  #show the results

## 
## Mediation/Moderation Analysis 
## Call: mediate(y = performance ~ caffeine + (arousal), data = toy)
## 
## The DV (Y) was  performance . The IV (X) was  caffeine . The mediating variable(s) =  arousal .
## 
## Total effect(c) of  caffeine  on  performance  =  0.36   S.E. =  0.03  t  =  12.19  df=  998   with p =  5.8e-32
## Direct effect (c') of  caffeine  on  performance  removing  arousal  =  0   S.E. =  0.03  t  =  0  df=  997   with p =  1
## Indirect effect (ab) of  caffeine  on  performance  through  arousal   =  0.36 
## Mean bootstrapped indirect effect =  0.36  with standard error =  0.02  Lower CI =  0.32    Upper CI =  0.4
## R = 0.6 R2 = 0.36   F = 280.41 on 2 and 997 DF   p-value:  6.02e-132 
## 
##  To see the longer output, specify short = FALSE in the print statement or ask for the summary

#But what if change the order of the variables?
mod2 <- mediate(performance ~ arousal + (caffeine), data= toy,plot=FALSE)

## The data matrix was a correlation matrix and the number of subjects was not specified. 
##  n.obs arbitrarily set to 1000
## The replication data matrices were simulated based upon the specified number of subjects and the observed correlation matrix.

mod2  #and now show the incorrect model

## 
## Mediation/Moderation Analysis 
## Call: mediate(y = performance ~ arousal + (caffeine), data = toy, plot = FALSE)
## 
## The DV (Y) was  performance . The IV (X) was  arousal . The mediating variable(s) =  caffeine .
## 
## Total effect(c) of  arousal  on  performance  =  0.6   S.E. =  0.03  t  =  23.69  df=  998   with p =  8.1e-99
## Direct effect (c') of  arousal  on  performance  removing  caffeine  =  0.6   S.E. =  0.03  t  =  18.95  df=  997   with p =  1.3e-68
## Indirect effect (ab) of  arousal  on  performance  through  caffeine   =  0 
## Mean bootstrapped indirect effect =  -0.01  with standard error =  0.02  Lower CI =  -0.05    Upper CI =  0.03
## R = 0.6 R2 = 0.36   F = 280.41 on 2 and 997 DF   p-value:  6.02e-132 
## 
##  To see the longer output, specify short = FALSE in the print statement or ask for the summary

mediate.diagram(mod2,main="An incorrect model")

350: Correlation and Regression

William Revelle

05/01/2024

Detecting multivariate outliers

Sort the data so we can see the results more clearly

Sort a data frame by a number of columns

Effect size as a index of the difference of groups

Graphically showing the `cohen.d` results

Estimating bias by bootstrapping.

Galton, Pearson, and the correlation coefficient

Show it graphically

An example of made up data

notice how the simulated data are z scores.

How does rescale work?

Now we have rescaled data, lets continue

another graphical display of correlations

Show the x -axis labels in vertical mode

Showing significance in a more conventional manner

Multiple Regression/multiple correlation

Multiple regression predicts Y from multiple Xs

By default `lmCor` standardizes

Mediation and Moderation

350: Correlation and Regression

William Revelle

05/01/2024

Detecting multivariate outliers

Sort the data so we can see the results more clearly

Sort a data frame by a number of columns

Effect size as a index of the difference of groups

Graphically showing the cohen.d results

Estimating bias by bootstrapping.

Galton, Pearson, and the correlation coefficient

Show it graphically

An example of made up data

notice how the simulated data are z scores.

How does rescale work?

Now we have rescaled data, lets continue

another graphical display of correlations

Show the x -axis labels in vertical mode

Showing significance in a more conventional manner

Multiple Regression/multiple correlation

Multiple regression predicts Y from multiple Xs

By default lmCor standardizes

Mediation and Moderation

Graphically showing the `cohen.d` results

By default `lmCor` standardizes