If we describe a data set (e.g. sat.act) we can express all scores as standard scores, z \[z = \frac{X - \bar{X}}{\sqrt{Var_x}}\]
If the data are normally distributed, then we would expect the distribution of z to be normal and to not have many observations \(> |3|\).
But what if we have multiple variables? We can find the multivariate equivalent of z by finding the Mahalanobis Distance which considers the squared distances from the centroid, weighting each variable so that they are independent. We define \(\Sigma\) as the variance/covariance matrix of the X variables. The inverse of \(\Sigma\) is shown as \(\Sigma^{-1}\) and weights each variable by the independent parts of the variable.
The inverse of the correlation matrix partials out each variable from all the other variables. The result is that we are considering the independent effects:
\[D^2 = ({X}_{ij} - \bar{X_j})' \Sigma^{-1} ({X}_{ij} - \bar{X_j})\] That is, find how much the each subject differs from the mean differ on each variable. (This is a generalization of a deviation score from the mean to the deviation from the multivariate mean), and then weight each difference by the inverse of the correlation of the variables.
We will use the outlier function in psych to do this, and to plot the results. Then we will examine how it does it.
A similar calculation is done when we find the cohen.d
between two groups on multiple variables.
\(D^2\) is analogous to the \(R^2\) of multiple regression.
library(psych)
library(psychTools)
d2 <- outlier(sat.act) #finds the outliers and draws a graph
describe(d2)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 700 5.99 4.59 4.86 5.25 2.86 0.71 42.3 41.6 2.75 12.19 0.17
The outlier function finds the \(D^2\) statistic and then plots it against the expected values of a \(\chi^2\) distribution. (Remember that a \(\chi^2\) is just a squared z score.) The plot function used is known as a qq plot for a quantile by quantile plot. It sorts the data and then plots the x and y values.
We show this first for 1000 normal deviates and their squares, plotted against the quantiles of \(\chi^2\).
set.seed(42)
n.obs <- 10000
z <- rnorm(n.obs) #n.obs normal deviates
z2 <- z^2 #and their squares
Chi2 <- qchisq(ppoints(n.obs), df =1)
qqplot(Chi2,z2,main="A quantile plot of squared normal deviates")
abline(0, 1, col = "gray") #show the equality line
Now, we do it for our real data.
n.obs <- NROW(d2)
nvar <- 6
Chi2 <- qchisq(ppoints(n.obs), df =nvar)
qqplot(Chi2, d2, main = expression("Q-Q plot of Mahalanobis" *
~D^2 * " vs. quantiles of" * ~chi[nvar]^2))
abline(0, 1, col = "gray")
Who are these unusual people? Join sat.act to the d2 object using the cbind function, and plot it.
sat.d2 <- cbind(sat.act,d2) #combine the columns to make a new variable
#sat.d2 <- data.frame(sat.act,d2) #this does the same
pairs.panels(sat.d2,bg=c("yellow","blue")[(d2 > 25)+1],pch=21,gap=0)
The order function will take a vector and return the index of the order of the cells. This can be done either in an increasing or decreasing order.
ord <- order(d2,decreasing=TRUE)
sat.d2 <- sat.d2[ord,] #rearrange them to be in the order of the d2 variable
headTail(sat.d2,top=10,bottom=5)
## gender education age ACT SATV SATQ d2
## 33337 2 3 23 35 200 200 42.3
## 36093 1 2 22 3 200 400 35.12
## 35186 2 1 17 35 350 230 34.65
## 30980 2 5 65 23 700 650 28.04
## 38502 2 5 61 30 300 350 26.29
## 35088 2 2 57 35 650 750 25.93
## 31299 1 1 45 35 450 500 23.99
## 30899 2 3 20 30 240 600 23.53
## 33547 1 2 55 25 550 700 23.51
## 29728 2 2 55 28 620 450 22.82
## ... ... ... ... ... ... ... ...
## 39242 2 3 21 27 610 <NA> 0.94
## 35949 2 3 22 30 650 650 0.9
## 30455 2 4 29 29 600 600 0.86
## 33130 2 3 22 27 570 570 0.82
## 38899 2 3 22 28 600 600 0.71
By inspection, it seems as if the first 4 subjects are probably incorrect, the others are somewhat questionable.
The order function will give the order for one column. The dfOrder function will do this for multiple columns at once.
new.sat <-dfOrder(sat.act) #sort the entire file
headTail(new.sat,top=10,bottom=10)
## gender education age ACT SATV SATQ
## 34895 1 0 14 30 600 600
## 37071 1 0 15 32 650 450
## 37067 1 0 16 20 450 450
## 34731 1 0 16 24 540 540
## 31730 1 0 16 24 800 800
## 31741 1 0 16 26 800 800
## 34284 1 0 16 35 700 750
## 37357 1 0 16 36 765 677
## 30444 1 0 17 22 490 530
## 36703 1 0 17 28 420 530
## ... ... ... ... ... ... ...
## 30030 2 5 52 24 630 570
## 37317 2 5 56 34 730 700
## 31367 2 5 57 23 650 400
## 31982 2 5 57 32 603 695
## 31758 2 5 58 26 740 700
## 32406 2 5 58 27 550 575
## 38502 2 5 61 30 300 350
## 31454 2 5 61 31 620 700
## 31376 2 5 62 36 750 790
## 30980 2 5 65 23 700 650
There are many ways of reporting how two groups differ. Cohen’s d statistic is just the differences of means expressed in terms of the pooled within group standard deviation. This is insensitive to sample size.
r is the a universal measure of effect size that is a simple function of d, but is bounded -1 to 1.
The t statistic is merely d * sqrt(n)/2 and thus reflects sample size.
Confidence intervals for Cohen’s d may be found by converting the d to a t, finding the confidence intervals for t, and then converting those back to ds. This take advantage of the uniroot function and the non-centrality parameter of the t distribution.
See cohen.d
We use our sat.act
data set for an example
library(psych)
cohen.d(sat.act,"gender") # test on all the subjects
## Call: cohen.d(x = sat.act, group = "gender")
## Cohen d statistic of difference between two means
## lower effect upper
## education 0.03 0.18 0.34
## age -0.20 -0.04 0.11
## ACT -0.23 -0.08 0.08
## SATV -0.19 -0.04 0.12
## SATQ -0.51 -0.35 -0.19
##
## Multivariate (Mahalanobis) distance between groups
## [1] 0.52
## r equivalent of difference between two means
## education age ACT SATV SATQ
## 0.09 -0.02 -0.04 -0.02 -0.17
cohen.d(sat.act[1:40,],"gender") #and then just the first 40 subjects
## Call: cohen.d(x = sat.act[1:40, ], group = "gender")
## Cohen d statistic of difference between two means
## lower effect upper
## education -0.15 0.49 1.12
## age -0.60 0.03 0.65
## ACT -0.49 0.14 0.76
## SATV -0.61 0.02 0.64
## SATQ -1.10 -0.47 0.17
##
## Multivariate (Mahalanobis) distance between groups
## [1] 0.82
## r equivalent of difference between two means
## education age ACT SATV SATQ
## 0.24 0.01 0.07 0.01 -0.23
#vs
t.test(education ~ gender,data=sat.act)
##
## Welch Two Sample t-test
##
## data: education by gender
## t = -2.2299, df = 453.96, p-value = 0.02624
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.48935928 -0.03087916
## sample estimates:
## mean in group 1 mean in group 2
## 2.995951 3.256071
t.test(education ~ gender,data=sat.act[1:40,]) #the M.dist function is biased by sample size
##
## Welch Two Sample t-test
##
## data: education by gender
## t = -1.4356, df = 28.257, p-value = 0.1621
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -1.3601898 0.2389777
## sample estimates:
## mean in group 1 mean in group 2
## 3.166667 3.727273
The cohen.d
function returns all kinds of output. We
just showed the summary
d <- cohen.d(sat.act,"gender")
names(d)
## [1] "cohen.d" "hedges.g" "M.dist" "r" "t" "n"
## [7] "p" "wt.d" "descriptive" "se" "dict" "order"
## [13] "Call"
names(d$descriptive)
## [1] "mean" "sd" "n" "F" "ICC1" "ICC2" "ci1" "ci2" "raw" "rbg" "pbg" "rwg"
## [13] "nw" "ci.wg" "pwg" "etabg" "etawg" "nwg" "nG" "Call"
d$descriptive$mean
## gender education age ACT SATV SATQ
## 1 1 2.995951 25.86235 28.78543 615.1134 635.8735
## 2 2 3.256071 25.44812 28.41722 610.6645 595.9955
cohen.d
resultserror.dots
takes the cohen.d
output and
displays it graphically.
error.dots(d,main="effect of gender on the sat.act data")
abline(v=0) #add a line to show no difference
The M.dist function (as used in cohen.d) is sensitive to the number of variables and the number of subjects. We can get a confidence interval around 0 effect size by simulation.
#First create a small function
cohen.d.expected <- function(x,group,n.rep=50 ) {
summary <- list() #we will put the results here
n.obs <- nrow(x)
observed <- cohen.d(x=x,group=group)$M.dist
ind <- 1:n.obs
for(i in 1:n.rep){
samp <- sample(ind,n.obs,replace=FALSE) #this is a random permutation of the order variable
x[,group] <- x[samp,group]
summary[[i]] <- cohen.d(x,group)$M.dist
}
result <- unlist(summary)
mean.boot <- mean(result)
sd.boot <- sd(result)
result <-list(observed=observed,mean = mean.boot,sd=sd.boot,trials =result)
return(result)
}
test <- cohen.d.expected(sat.act,group="gender", n.rep=1000)
test$observed
## [1] 0.5158626
describe(test$trials,quant= c(.9,.95,.99))
## vars n mean sd median trimmed mad min max range skew kurtosis se Q0.9 Q0.95 Q0.99
## X1 1 1000 0.17 0.06 0.17 0.17 0.06 0.04 0.38 0.34 0.42 0.05 0 0.25 0.27 0.31
hist(test$trials,breaks=41)
##do it again, but on all smaller sample
test <- cohen.d.expected(sat.act[1:40,],group="gender", n.rep=1000)
test$observed
## [1] 0.8241425
describe(test$trials,quant= c(.9,.95,.99))
## vars n mean sd median trimmed mad min max range skew kurtosis se Q0.9 Q0.95 Q0.99
## X1 1 1000 0.76 0.26 0.73 0.75 0.26 0.16 1.97 1.81 0.56 0.3 0.01 1.12 1.24 1.43
hist(test$trials,breaks=41)
We see that for 700 subjects, the confidence interval around 0 for the expected \(D^2\) is much less than we observe. But for 40 subjects, the \(D^2\) is bigger, but the range of chance results is as well. We can not reject the hypothesis that the groups do not differ, even though the \(D^2\) is bigger.
library(psych)
data(galton)
galton.tab <- table(galton) #make a table from the data
galton.tab[order(rank(rownames(galton.tab)),decreasing=TRUE),] #sort it by decreasing row va
## child
## parent 61.7 62.2 63.2 64.2 65.2 66.2 67.2 68.2 69.2 70.2 71.2 72.2 73.2 73.7
## 73 0 0 0 0 0 0 0 0 0 0 0 1 3 0
## 72.5 0 0 0 0 0 0 0 1 2 1 2 7 2 4
## 71.5 0 0 0 0 1 3 4 3 5 10 4 9 2 2
## 70.5 1 0 1 0 1 1 3 12 18 14 7 4 3 3
## 69.5 0 0 1 16 4 17 27 20 33 25 20 11 4 5
## 68.5 1 0 7 11 16 25 31 34 48 21 18 4 3 0
## 67.5 0 3 5 14 15 36 38 28 38 19 11 4 0 0
## 66.5 0 3 3 5 2 17 17 14 13 4 0 0 0 0
## 65.5 1 0 9 5 7 11 11 7 7 5 2 1 0 0
## 64.5 1 1 4 4 1 5 5 0 2 0 0 0 0 0
## 64 1 0 2 4 1 2 2 1 1 0 0 0 0 0
#show the scatter plot and the lowess fit
pairs.panels(galton,main="Galton's Parent child heights")
#but this makes the regression lines look the same
pairs.panels(galton,lm=TRUE,main="Galton's Parent child heights")
#better is to scale them
pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),main="Galton's Parent child heights")
pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights")
#now add some noise
pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights",jiggle=TRUE)
#and even more noise
pairs.panels(galton,lm=TRUE,xlim=c(62,74),ylim=c(62,74),cor=FALSE,main="Galton's Parent child heights",jiggle=TRUE,factor=4)
# Lets explore multiple regression
We will create some data to have certain properties. We use the
sim.structural
function from psych
set.seed(42)
fx <-matrix(c( .9,.8,.6,rep(0,4),.6,.8,-.7),ncol=2)
fy <- matrix(c(.6,.5,.4),ncol=1)
rownames(fx) <- c("GREV","GREQ","GREA","nach","Anx")
rownames(fy)<- c("gpa","Pre","MA")
Phi <-matrix( c(1,0,.7,.0,1,.7,.7,.7,1),ncol=3)
gre.gpa <- sim.structural(fx,Phi,fy,n=1000)
grad.sim <- gre.gpa$observed
Or we can read the file from a remote server. (Commented out while flying)
datafilename="http://personality-project.org/r/datasets/psychometrics.prob2.txt"
grad <- read.table(datafilename,header=TRUE) #read the data file using cor R
grad <- read.file(datafilename) #using psychTools
## Data from the .txt file http://personality-project.org/r/datasets/psychometrics.prob2.txt has been loaded.
# remember to always do dim and describe to know your data
#grad <- gre.gpa$observed
dim(grad)
## [1] 1000 9
describe(grad)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## ID 1 1000 500.50 288.82 500.50 500.50 370.65 1.0 1000.00 999.00 0.00 -1.20 9.13
## GREV 2 1000 499.77 106.11 497.50 498.74 106.01 138.0 873.00 735.00 0.09 -0.07 3.36
## GREQ 3 1000 500.53 103.85 498.00 498.51 105.26 191.0 914.00 723.00 0.22 0.08 3.28
## GREA 4 1000 498.13 100.45 495.00 498.67 99.33 207.0 848.00 641.00 -0.02 -0.06 3.18
## Ach 5 1000 49.93 9.84 50.00 49.88 10.38 16.0 79.00 63.00 0.00 0.02 0.31
## Anx 6 1000 50.32 9.91 50.00 50.43 10.38 14.0 78.00 64.00 -0.14 0.14 0.31
## Prelim 7 1000 10.03 1.06 10.00 10.02 1.48 7.0 13.00 6.00 -0.02 -0.01 0.03
## GPA 8 1000 4.00 0.50 4.02 4.01 0.53 2.5 5.38 2.88 -0.07 -0.29 0.02
## MA 9 1000 3.00 0.49 3.00 3.00 0.44 1.4 4.50 3.10 -0.07 -0.09 0.02
We can convert them to ‘typical’ scores by using the
rescale
function (from `psych)
sds <- c(100,100,100,10,10,1,.5,.5)
means <- c(500,500,500,50,50,10,4,3)
grad.sim <- rescale(grad.sim,mean=means,sd=sds)
describe(grad.sim)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## GREV 1 1000 500 100.0 498.53 499.31 100.39 186.68 844.79 658.11 0.08 -0.08 3.16
## GREQ 2 1000 500 100.0 493.67 499.42 99.44 230.65 836.79 606.14 0.07 -0.11 3.16
## GREA 3 1000 500 100.0 498.14 500.11 98.79 124.46 801.40 676.95 -0.01 0.07 3.16
## nach 4 1000 50 10.0 49.89 49.97 9.84 21.35 79.39 58.03 0.01 -0.05 0.32
## Anx 5 1000 50 10.0 49.92 49.96 9.63 16.04 82.15 66.11 0.01 0.32 0.32
## gpa 6 1000 10 1.0 10.02 10.00 1.04 6.87 12.98 6.12 0.01 -0.20 0.03
## Pre 7 1000 4 0.5 4.00 4.00 0.51 2.12 5.68 3.57 0.00 0.00 0.02
## MA 8 1000 3 0.5 3.00 3.00 0.50 1.17 4.64 3.47 -0.09 0.09 0.02
Look at the help menu to see what it does, or look at the code to see how it works.
rescale
## function (x, mean = 100, sd = 15, df = TRUE)
## {
## if (df) {
## x <- data.frame(t(t(scale(x)) * sd + mean))
## }
## else {
## x <- t(t(scale(x)) * sd + mean)
## }
## return(x)
## }
## <bytecode: 0x11bac9e08>
## <environment: namespace:psych>
gradq <- subset(grad,grad[2]>700)#choose the subset
gradq <- as.data.frame(gradq)
with(gradq,lm(GREV ~ GREQ)) #do the regression
##
## Call:
## lm(formula = GREV ~ GREQ)
##
## Coefficients:
## (Intercept) GREQ
## 583.3783 0.2307
op <- par(mfrow=c(1,2)) #two panel graph
with(grad,{
plot(GREV ~ GREQ,xlim=c(200,800),main='Original data', pch=16)
abline(lm(GREV ~ GREQ))
})
text(300,500,'r = .46 b = .56')
with(gradq,{
plot(GREV ~ GREQ,ylim=c(200,800),xlim=c(200,800),main='GRE Q > 700',pch=16)
abline(lm(GREV ~ GREQ))
})
text(300,500,'r = .18 b = .50')
op <- par(mfrow=c(1,1)) #switch back to one panel
Although pairs.panels
is useful for smaller sets of
variables, a heat map is useful for larger matrices.
We use corPlot
on the bfi
data set.
corPlot(bfi)
#now, do it again, with numeric values as well
corPlot(bfi,numbers=TRUE)
#perhaps cleaner is to just show the lower off diagonal
corPlot(bfi,numbers=TRUE,upper=FALSE)
We had scaled the number to reflect significance. Turn this option off
corPlot(bfi,numbers=TRUE,upper=FALSE,scale=FALSE) #Turn off the scaling
Graphic functions have all kinds of extra parameters you can pass to them (See ?plot) We can rotate the labels on the x axis to be perpendicular to the x axis by using the las=3 option.
corPlot(bfi,numbers=TRUE,upper=FALSE,scale=FALSE,las=3) #no scaling, veritcal labels
corPlot(bfi[1:10],stars=TRUE,numbers=TRUE,upper=FALSE,diag=FALSE,las=3)
We can find the confidence intervals of our correlations, either by bootstrap or by normal theory.
bfi.ci <- corCi(bfi[1:10])
bfi.ci #show it
## Call:corCi(x = bfi[1:10])
##
## Coefficients and bootstrapped confidence intervals
## A1 A2 A3 A4 A5 C1 C2 C3 C4 C5
## A1 1.00
## A2 -0.34 1.00
## A3 -0.27 0.49 1.00
## A4 -0.15 0.34 0.36 1.00
## A5 -0.18 0.39 0.50 0.31 1.00
## C1 0.03 0.09 0.10 0.09 0.12 1.00
## C2 0.02 0.14 0.14 0.23 0.11 0.43 1.00
## C3 -0.02 0.19 0.13 0.13 0.13 0.31 0.36 1.00
## C4 0.13 -0.15 -0.12 -0.15 -0.13 -0.34 -0.38 -0.34 1.00
## C5 0.05 -0.12 -0.16 -0.24 -0.17 -0.25 -0.30 -0.34 0.48 1.00
##
## scale correlations and bootstrapped confidence intervals
## lower.emp lower.norm estimate upper.norm upper.emp p
## A1-A2 -0.37 -0.37 -0.34 -0.30 -0.30 0.00
## A1-A3 -0.30 -0.30 -0.27 -0.22 -0.22 0.00
## A1-A4 -0.19 -0.19 -0.15 -0.11 -0.11 0.00
## A1-A5 -0.21 -0.22 -0.18 -0.14 -0.13 0.00
## A1-C1 -0.01 -0.01 0.03 0.06 0.06 0.18
## A1-C2 -0.02 -0.02 0.02 0.05 0.05 0.44
## A1-C3 -0.06 -0.06 -0.02 0.02 0.03 0.37
## A1-C4 0.09 0.09 0.13 0.17 0.17 0.00
## A1-C5 0.01 0.01 0.05 0.09 0.10 0.02
## A2-A3 0.45 0.45 0.49 0.52 0.52 0.00
## A2-A4 0.30 0.29 0.34 0.37 0.37 0.00
## A2-A5 0.34 0.35 0.39 0.42 0.42 0.00
## A2-C1 0.06 0.05 0.09 0.14 0.14 0.00
## A2-C2 0.08 0.09 0.14 0.18 0.18 0.00
## A2-C3 0.14 0.15 0.19 0.24 0.24 0.00
## A2-C4 -0.18 -0.18 -0.15 -0.11 -0.11 0.00
## A2-C5 -0.16 -0.16 -0.12 -0.08 -0.08 0.00
## A3-A4 0.33 0.33 0.36 0.39 0.40 0.00
## A3-A5 0.47 0.47 0.50 0.54 0.54 0.00
## A3-C1 0.06 0.05 0.10 0.14 0.15 0.00
## A3-C2 0.11 0.10 0.14 0.18 0.18 0.00
## A3-C3 0.09 0.09 0.13 0.17 0.17 0.00
## A3-C4 -0.16 -0.16 -0.12 -0.08 -0.09 0.00
## A3-C5 -0.19 -0.20 -0.16 -0.12 -0.11 0.00
## A4-A5 0.27 0.27 0.31 0.34 0.34 0.00
## A4-C1 0.05 0.05 0.09 0.13 0.12 0.00
## A4-C2 0.19 0.19 0.23 0.27 0.27 0.00
## A4-C3 0.09 0.09 0.13 0.17 0.17 0.00
## A4-C4 -0.18 -0.19 -0.15 -0.12 -0.12 0.00
## A4-C5 -0.28 -0.28 -0.24 -0.20 -0.21 0.00
## A5-C1 0.08 0.08 0.12 0.16 0.16 0.00
## A5-C2 0.08 0.07 0.11 0.15 0.15 0.00
## A5-C3 0.09 0.09 0.13 0.18 0.17 0.00
## A5-C4 -0.16 -0.16 -0.13 -0.09 -0.09 0.00
## A5-C5 -0.22 -0.21 -0.17 -0.12 -0.12 0.00
## C1-C2 0.38 0.38 0.43 0.47 0.47 0.00
## C1-C3 0.26 0.26 0.31 0.35 0.35 0.00
## C1-C4 -0.38 -0.38 -0.34 -0.29 -0.30 0.00
## C1-C5 -0.29 -0.29 -0.25 -0.20 -0.20 0.00
## C2-C3 0.32 0.32 0.36 0.40 0.40 0.00
## C2-C4 -0.42 -0.42 -0.38 -0.34 -0.34 0.00
## C2-C5 -0.33 -0.34 -0.30 -0.26 -0.26 0.00
## C3-C4 -0.38 -0.38 -0.34 -0.29 -0.28 0.00
## C3-C5 -0.38 -0.38 -0.34 -0.31 -0.31 0.00
## C4-C5 0.44 0.44 0.48 0.51 0.51 0.00
corPlotUpperLowerCi(bfi.ci)
We use the epi.bfi data set as a convenient example.
describe(epi.bfi)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## epiE 1 231 13.33 4.14 14 13.49 4.45 1 22 21 -0.33 -0.06 0.27
## epiS 2 231 7.58 2.69 8 7.77 2.97 0 13 13 -0.57 -0.02 0.18
## epiImp 3 231 4.37 1.88 4 4.36 1.48 0 9 9 0.06 -0.62 0.12
## epilie 4 231 2.38 1.50 2 2.27 1.48 0 7 7 0.66 0.24 0.10
## epiNeur 5 231 10.41 4.90 10 10.39 4.45 0 23 23 0.06 -0.50 0.32
## bfagree 6 231 125.00 18.14 126 125.26 17.79 74 167 93 -0.21 -0.27 1.19
## bfcon 7 231 113.25 21.88 114 113.42 22.24 53 178 125 -0.02 0.23 1.44
## bfext 8 231 102.18 26.45 104 102.99 22.24 8 168 160 -0.41 0.51 1.74
## bfneur 9 231 87.97 23.34 90 87.70 23.72 34 152 118 0.07 -0.55 1.54
## bfopen 10 231 123.43 20.51 125 123.78 20.76 73 173 100 -0.16 -0.16 1.35
## bdi 11 231 6.78 5.78 6 5.97 4.45 0 27 27 1.29 1.50 0.38
## traitanx 12 231 39.01 9.52 38 38.36 8.90 22 71 49 0.67 0.47 0.63
## stateanx 13 231 39.85 11.48 38 38.92 10.38 21 79 58 0.72 -0.01 0.76
lowerCor(epi.bfi)
## epiE epiS epImp epili epiNr bfagr bfcon bfext bfner bfopn bdi trtnx sttnx
## epiE 1.00
## epiS 0.85 1.00
## epiImp 0.80 0.43 1.00
## epilie -0.22 -0.05 -0.24 1.00
## epiNeur -0.18 -0.22 -0.07 -0.25 1.00
## bfagree 0.18 0.20 0.08 0.17 -0.08 1.00
## bfcon -0.11 0.05 -0.24 0.23 -0.13 0.45 1.00
## bfext 0.54 0.58 0.35 -0.04 -0.17 0.48 0.27 1.00
## bfneur -0.09 -0.07 -0.09 -0.22 0.63 -0.04 0.04 0.04 1.00
## bfopen 0.14 0.15 0.07 -0.03 0.09 0.39 0.31 0.46 0.29 1.00
## bdi -0.16 -0.13 -0.11 -0.20 0.58 -0.14 -0.18 -0.14 0.47 -0.08 1.00
## traitanx -0.23 -0.26 -0.12 -0.23 0.73 -0.31 -0.29 -0.39 0.59 -0.11 0.65 1.00
## stateanx -0.13 -0.12 -0.09 -0.15 0.49 -0.19 -0.14 -0.15 0.49 -0.04 0.61 0.57 1.00
corPlot(epi.bfi)
\(\hat{Y}= \mu + \beta X\)
Two functions to do this:
lm (for linear model) in cor R requires complete data and gives very useful diagnostics.
lmCor (in psych
) does not require complete data, will
work from the correlation matrix, and draws regression path diagrams.
Does not give the diagnostics of lm.
mod <- lm(bdi ~ epiNeur + traitanx,data= epi.bfi)
mod #just the coefficients
##
## Call:
## lm(formula = bdi ~ epiNeur + traitanx, data = epi.bfi)
##
## Coefficients:
## (Intercept) epiNeur traitanx
## -7.6378 0.2551 0.3015
summary(mod) #more useful information
##
## Call:
## lm(formula = bdi ~ epiNeur + traitanx, data = epi.bfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0288 -2.6848 -0.5121 1.9823 13.3081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.63779 1.24729 -6.124 3.97e-09 ***
## epiNeur 0.25506 0.08448 3.019 0.00282 **
## traitanx 0.30151 0.04347 6.935 4.13e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.299 on 228 degrees of freedom
## Multiple R-squared: 0.4507, Adjusted R-squared: 0.4459
## F-statistic: 93.53 on 2 and 228 DF, p-value: < 2.2e-16
Compare this to `lmCor’
mod1 <- lmCor(bdi ~ epiNeur + traitanx,data= epi.bfi)
mod1
## Call: lmCor(y = bdi ~ epiNeur + traitanx, data = epi.bfi)
##
## Multiple Regression from raw data
##
## DV = bdi
## slope se t p lower.ci upper.ci VIF Vy.x
## (Intercept) 0.00 0.05 0.00 1.0e+00 -0.10 0.10 1.00 0.00
## epiNeur 0.22 0.07 3.02 2.8e-03 0.08 0.36 2.13 0.13
## traitanx 0.50 0.07 6.94 4.1e-11 0.36 0.64 2.13 0.33
##
## Residual Standard Error = 0.74 with 228 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## bdi 0.67 0.45 0.66 0.44 0.45 0.05 93.53 2 228 2.19e-30
summary(mod1)
##
## Multiple Regression from raw data
## lmCor(y = bdi ~ epiNeur + traitanx, data = epi.bfi)
##
## Multiple Regression from matrix input
##
## Beta weights
## bdi
## (Intercept) 0.00
## epiNeur 0.22
## traitanx 0.50
##
## Multiple R
## bdi
## 0.67
##
## Multiple R2
## bdi
## 0.45
##
## Cohen's set correlation R2
## [1] 0.45
##
## Squared Canonical Correlations
## NULL
lmCor
standardizesmod1 <- lmCor(bdi ~ epiNeur + traitanx,data= epi.bfi,std=FALSE)
mod1
## Call: lmCor(y = bdi ~ epiNeur + traitanx, data = epi.bfi, std = FALSE)
##
## Multiple Regression from raw data
##
## DV = bdi
## slope se t p lower.ci upper.ci VIF Vy.x
## (Intercept) -7.64 1.25 -6.12 4.0e-09 -10.10 -5.18 1.00 0.00
## epiNeur 0.26 0.08 3.02 2.8e-03 0.09 0.42 2.13 0.13
## traitanx 0.30 0.04 6.94 4.1e-11 0.22 0.39 2.13 0.33
##
## Residual Standard Error = 4.3 with 228 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## bdi 0.67 0.45 0.58 0.34 0.45 0.05 93.53 2 228 2.19e-30
summary(mod1)
##
## Multiple Regression from raw data
## lmCor(y = bdi ~ epiNeur + traitanx, data = epi.bfi, std = FALSE)
##
## Multiple Regression from matrix input
##
## Beta weights
## bdi
## (Intercept) -7.64
## epiNeur 0.26
## traitanx 0.30
##
## Multiple R
## bdi
## 0.67
##
## Multiple R2
## bdi
## 0.45
##
## Cohen's set correlation R2
## [1] -50068
##
## Squared Canonical Correlations
## NULL
If some DVs are predicted by some IVs, but the effect is thought to go through some intervening variable, we can think of that variable as mediating the IV-DV relationship. For instance caffeine leads to an increase in cognitive peformance, caffeine also increases arousal and arousal is related to increased performance. This relationships may be seen in the toy example:
toy <- structure(c(1, 0.6, 0.36, 0.6, 1, 0.6, 0.36, 0.6, 1), .Dim = c(3L,
3L), .Dimnames = list(c("caffeine", "arousal", "performance"),
c("caffeine", "arousal", "performance")))
lowerMat(toy)
## caffn arosl prfrm
## caffeine 1.00
## arousal 0.60 1.00
## performance 0.36 0.60 1.00
We can test the mediation effect using the mediate
function which treats the model as a regression model but reports on the
indirect effects of the mediator. Note that mediation
models logically require the possibility of mediation, but this is not
part of the statistical model.
We compare two mediation models. One that makes sense (arousal mediates the effect of caffeine) and one that does not (caffeine mediates the effect of arousal).
#First show it as a regression model
lmCor(performance ~ caffeine, data= toy, n.obs=100) #The simple model
## Call: lmCor(y = performance ~ caffeine, data = toy, n.obs = 100)
##
## Multiple Regression from matrix input
##
## DV = performance
## slope se t p lower.ci upper.ci VIF Vy.x
## caffeine 0.36 0.09 3.82 0.00023 0.17 0.55 1 0.13
##
## Residual Standard Error = 0.94 with 98 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## performance 0.36 0.13 0.36 0.13 0.12 0.06 14.59 1 98 0.000234
lmCor(performance ~ caffeine + arousal, data= toy, n.obs=100) #perhaps a better model
## Call: lmCor(y = performance ~ caffeine + arousal, data = toy, n.obs = 100)
##
## Multiple Regression from matrix input
##
## DV = performance
## slope se t p lower.ci upper.ci VIF Vy.x
## caffeine 0.0 0.1 0.00 1.0e+00 -0.2 0.2 1.56 0.00
## arousal 0.6 0.1 5.91 5.1e-08 0.4 0.8 1.56 0.36
##
## Residual Standard Error = 0.81 with 97 degrees of freedom
##
## Multiple Regression
## R R2 Ruw R2uw Shrunken R2 SE of R2 overall F df1 df2 p
## performance 0.6 0.36 0.54 0.29 0.35 0.07 27.28 2 97 3.98e-10
mod1 <- mediate(performance ~ caffeine + (arousal), data= toy)
## The data matrix was a correlation matrix and the number of subjects was not specified.
## n.obs arbitrarily set to 1000
## The replication data matrices were simulated based upon the specified number of subjects and the observed correlation matrix.
mod1 #show the results
##
## Mediation/Moderation Analysis
## Call: mediate(y = performance ~ caffeine + (arousal), data = toy)
##
## The DV (Y) was performance . The IV (X) was caffeine . The mediating variable(s) = arousal .
##
## Total effect(c) of caffeine on performance = 0.36 S.E. = 0.03 t = 12.19 df= 998 with p = 5.8e-32
## Direct effect (c') of caffeine on performance removing arousal = 0 S.E. = 0.03 t = 0 df= 997 with p = 1
## Indirect effect (ab) of caffeine on performance through arousal = 0.36
## Mean bootstrapped indirect effect = 0.36 with standard error = 0.02 Lower CI = 0.32 Upper CI = 0.4
## R = 0.6 R2 = 0.36 F = 280.41 on 2 and 997 DF p-value: 6.02e-132
##
## To see the longer output, specify short = FALSE in the print statement or ask for the summary
#But what if change the order of the variables?
mod2 <- mediate(performance ~ arousal + (caffeine), data= toy,plot=FALSE)
## The data matrix was a correlation matrix and the number of subjects was not specified.
## n.obs arbitrarily set to 1000
## The replication data matrices were simulated based upon the specified number of subjects and the observed correlation matrix.
mod2 #and now show the incorrect model
##
## Mediation/Moderation Analysis
## Call: mediate(y = performance ~ arousal + (caffeine), data = toy, plot = FALSE)
##
## The DV (Y) was performance . The IV (X) was arousal . The mediating variable(s) = caffeine .
##
## Total effect(c) of arousal on performance = 0.6 S.E. = 0.03 t = 23.69 df= 998 with p = 8.1e-99
## Direct effect (c') of arousal on performance removing caffeine = 0.6 S.E. = 0.03 t = 18.95 df= 997 with p = 1.3e-68
## Indirect effect (ab) of arousal on performance through caffeine = 0
## Mean bootstrapped indirect effect = -0.01 with standard error = 0.02 Lower CI = -0.05 Upper CI = 0.03
## R = 0.6 R2 = 0.36 F = 280.41 on 2 and 997 DF p-value: 6.02e-132
##
## To see the longer output, specify short = FALSE in the print statement or ask for the summary
mediate.diagram(mod2,main="An incorrect model")