---
title: "350 Week 5 part 2: Regression and the Linear Model"
author: "William Revelle"
date: "04/06/20"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=100)
```
#Recent upgrade to psych

psych has been updated to 2.0.4
Get the latest release from the pmc server.  (It has been submitted to CRAN but is not there yet.)

The next line (to install the latest release) has been commented out because I have already updated my version.  However, you should do this once.

```{r}
#install.packages("psych",repos="http://personality-project.org/r",type="source")
library(psych) #make it active
library(psychTools)  #this one as well
 sessionInfo()   #make sure it is there
news(Version >  "2.0.0",package="psych") #read the news for this release
```

#The linear model

There are many forms of the linear model.

$$\hat{y} = b_1x + e$$  is the classic regression model.  If x is a dichotomous variable, this is just a t-test.  If x is categorical, then we can do this as an analysis of variance.  In this case, we think of x causing y.  The slope (b) is the ratio of the covariance of x and y divided by the variance of x.

$$b_1 = \frac{cov_{xy}}{var_x}$$   

$$= \frac{\sigma_{xy}}{\sigma_x^2} $$

If we think of y causing x, this becomes:

$$\hat{x} = b_1y + e$$  That is, y causes x.

If we are unsure of the direction of causality, we can find the geometric average of the two regressions and find

$$r_{xy} = \frac{cov_{xy}}{\sqrt{\sigma^2_x \sigma^2_y}} = cov_{z_x z_y}$$ which is the ratio of the covariance between x and y and the square root of the product of the two variances, or alternatively, the covariances of the standard scores. 

If we have more than one predictor, we have *Multiple regression*

$$\hat{y} = b_1 x_1 + b_2 x_2 + e$$  If $x_1$ and $x_2$ are categorical, this is also known as an analysis of variance.


## A path model representation of regression and multiple regression

First, just show the *npk* dataset from core R. Can you see any relationships here?

```{r}
npk   #The data come from Venables
```

As we do for any data set, we want to *describe* it and perhaps examine the pairwise relationships.  

```{r}
describe(npk) 
pairs.panels(npk)
```

When we describe the data, the variables are marked by an \* which indicates that they are not numeric (they are actuall *Factor*s with *levels*), but are treated by *describe* as numeric.   We can examine the *str* (structure) of the data to see this.

The *pairs.panels* function automatically converts *Factor* or *Character* values to *numeric*.


```{r}
str(npk)
```

Examining the correlations of the *npk* data we see that there are four experimental variables (IVs) and one dependent variable (DV). Because this is a balanced experiment, we see that the correlation of the IVs are all exactly zero and that they have different relationships with the DV.

Standardized regressions are just correlations.  We can show the relationships as a path model using the `setCor` function. By default, we show the standardized solution which expresses everything as correlations  (Left panel)  However we may also say  that we do not want standardized variables (right panel) to see the raw variance components.
```{r}
par(mfrow=c(1,2))  #set up a two panel graph
mod0 <- setCor(yield ~ N + P + K, data=npk,main="Standardized Regression")#show this as a figure
mod0 <- setCor(yield ~ N + P + K, data=npk,std=FALSE,main="Unstandardized Regression") #do not standardize
par(mfrow=c(1,1)) #set it back to a one panel graph
```

## the t-test examines one variable at a time

If the IV is categorical and has just two levels, we can examine the effect of the IV on the DV by doing a *t.test* 

```{r}
t.test(yield ~ K, data = npk)
```

The t-test is a special case of the linear model (in this case the X variable just has two values.) We can do the same analysis as the t-test but use the linear model function *lm*.

```{r}
model <- lm(yield ~ K, data = npk)
summary(model)
```

## Graphic displays
There are several ways of showing this result.  We can do an *error.bars* 

```{r}
error.bars.by(yield ~ K,data=npk)
```

Or we can use the *setCor* function to do the linear model and draw the graph.  Unfortunately, there is a bug (just discovevered) for the case of a single IV in setCor and we need to do a work around.  Note that the t statistic given for the regression slope is the same as the t statistic when doing the normal t-test. 


```{r}
tempR <- lowerCor(levels2numeric(npk))  #convert to numeric
setCor(yield ~ K, data = tempR, n.obs=24)  #do the regression from the correlations
```

#Analysis of variance and the linear model

Analysis of variance is a special case of the linear model.  We can see this when we compare two anovas with two multiple regresssions.  The first is just a single Independent Variable (IV) while the second one uses all three predictors.  The last analysis  (using *setCor*) does the same as the *lm* but draws a path model as well. 

```{r}

mod1 <- aov(yield ~ N,data=npk)
lm1 <- lm(yield ~ N, data=npk)
summary(mod1)
summary(lm1) #identical to the anova (in terms of p values, ts are sqrts of Fs)

mod2 <- aov(yield ~ N + P + K,data=npk)
lm2 <- lm(yield ~ N + P + K,data=npk)
summary(mod2)
summary(lm2) #also identical in terms of p values

#now do the analysis but draw a path model as well
setCor(yield ~ N + P + K,data=npk,main = "Regression Model using setCor")
```
Note that although the output looks very different, the t value's of the path coefficients are the same as the square roots of the F values in the aov model.  

##Regression with continuous variables

Regression is most appropriate with continuous predictors.  We do this for one predictor (SATV) and one DV (ACT).  We show the data points, fit with the best fitting linear model, and then show the model.

```{r}
plot(ACT ~ SATV, data= sat.act)  #show the points
abline(lm(ACT ~ SATV, data=sat.act)) #show the best fitting line
summary(lm(ACT ~ SATV, data=sat.act)) #summarize the results
```

But typically, we have more predictors.  Although showing the individual regressions is possible (and we do as a two panel graph) we can also summarize them graphically by a path model. (As we did earlier).  When doing regressions, the slopes are in the units of the predictor (e.g. SATV), and are sometimes hard to interpret.  Thus, we can express the slopes as standardized values to make them more understandable.

```{r}
par(mfrow=c(1,2))  #create a two panel graph
plot(ACT ~ SATV, data= sat.act)  #show the points
abline(lm(ACT ~ SATV, data=sat.act)) 
plot(ACT ~ SATQ, data= sat.act)  #show the points
abline(lm(ACT ~ SATQ, data=sat.act))
#do another 2 panel graph    We supress the actual output
mod3 <- setCor(ACT ~ SATV + SATQ, data=sat.act,main="Standardized (Correlations)")
mod4 <- setCor(ACT ~ SATV + SATQ, data=sat.act, std=FALSE,main="unstandardized")
par(mfrow=c(1,1)) #convert back to the normal 1 panel graph
summary(lm(ACT ~ SATV + SATQ, data=sat.act))
```
#Moderated multiple regression

An important question is whether one variable affects the relationship between another IV and DV.  In analysis of variance terms, this is asking if there is an interaction.  The standard anova approach just does this by including the interaction term (shown as a product in the formula).  But when we try this with regression, things get more complicated.

##The standard anova with an interation term

```{r}
mod2 <- aov(yield ~ N * P * K,data = npk)
summary(mod2)
```
But when do this as a linear model, it doesn't match! Compare the p values.  Only the 3 way interaction is correct.
```{r}
summary(lm(yield ~ N * P * K,data = npk))
```

###Show the problem in terms of product terms

The problem is that the product terms are correlated with the main effects.  We can show this by forming the product terms. Although tedious, it is possible to find the 4 product terms.  We do this, and then put the results into a data.frame which we when plot.  Note how the product terms are correlated with their elements.  This makes sense, but is wrong if we want to do the modeling.

First we create a new data.frame (NPK) which is made up of the *numeric* equivalents of what was in npk.

```{r}
NPK <- char2numeric(npk)
describe(npk)  #the original data are not technically numeric, they are factors
describe(NPK)  #we have converted them to numeric
```
Now show the correlations of the product terms and the main effects.  We use the `gap` parameter in `pairs.panels` to show a slightly more compact graphic.

```{r}
 NP <- NPK$N * NPK$P
 NK <- NPK$N * NPK$K
 PK <- NPK$P * NPK$K
 PKN <- PK * NPK$N
 NPK.prods <- data.frame(NPK,NP,NK,PK,PKN)
pairs.panels(NPK.prods,gap=0)  #show the correlations, tighten up the figure
```

The simple products are correlated with each other, and with the basic factors.  To get around this problem, we *center* the data using the `scale` function.  `scale` will center and if we set scale=TRUE it will also standardize (divide by the standard deviation).  The default is to standardize, so we specify that it is false.  We also need to convert the output of the `scale` function from a *matrix* to a *data.frame*.

```{r}
 centered.NPK <- scale(NPK,scale=FALSE)
 centered.NPK <- data.frame(scale(NPK,scale=FALSE))
 c.NP <- centered.NPK$N * centered.NPK$P
 c.NK <- centered.NPK$N * centered.NPK$K
 c.PK <- centered.NPK$P * centered.NPK$K
 c.PKN <- c.PK * centered.NPK$N
 center.prod <- data.frame(centered.NPK,c.NP,c.NK,c.PK,c.PKN)
 pairs.panels(center.prod) #show the results grapically
 lowerCor(center.prod)  #show it a correlation matrix as well
 
```
Note the products are now uncorrelated. 
If we now use the linear model function, it agrees with our anova output.

```{r}
lm.mod <- lm(yield ~ N * P * K, data = centered.NPK) #do the analysis on the centered data
summary(lm.mod)  #show the moderated multiple regression
summary(mod2) #compare with the original anova output
```
## Now do this with continuous x1 and a dichotomous x2 and graph the effect

Once again, to the analysis correctly, we need to center the data.


```{r}
cen.sat.act <- data.frame(scale(sat.act,scale=FALSE))
lm.mod2 <- lm(ACT ~ gender * SATV , data= cen.sat.act)
summary(lm.mod2)
```
 
There is a suggestion of an interaction between gender and SATV.  To see this graphically, we draw the data (using a different plot character for males and females)  and then draw two lines, one for males, and one for females.  The interaction is a difference in the slopes of these regressions.

```{r}
plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==1,]))
abline(lm(ACT ~ SATV,data = sat.act[sat.act$gender==2,]),lty="dashed")

# or another way of doing this is with the by statment
plot(ACT ~ SATV, pch= 21-gender, col=c("blue","red")[gender],data=sat.act,main="ACT scores differ by SATV and gender")
by(sat.act,sat.act$gender, function(x) abline(lm(ACT ~ SATV, data=x),lty=c("solid","dashed")[x$gender]))
by(sat.act,sat.act$gender,function(x) lm(ACT ~ SATV, data=x))  #this tells us which is which
text(220,17,"males")
text(220,22,"females")
```

#Some adventures in coding  -- *dummy.code*

Sometimes we want to take a categorical variable in which the categories are arbitrary (e.g., college major, ways of coping) and create a *dummy code* that has the value of 1 for people in that category and 0 for everyone else.  Clearly we need at least 3 categories for this to make sense. 

For following, we will take the `education` variable from the `spi` data set, convert the education levels into 0/1 dummy codes, and then find those items that most correlate with each level of education.

## But first, lets examine the `dummy.code` function

```{r}
dummy.code  #show the code for the function
```

We can examine each line of this code by trying it.  

```{r}
x <- spi$education
na.rm <- TRUE
t <- table(x)  # Call the table function on x
t   # we do not do this in the function, but we do it here to examine it
lt <- length(t)   # how many different categories
```
Then we create where we want to put the results
```{r}
 n.obs <- length(x)
 new <- matrix(0,nrow=n.obs,ncol=lt)
  if(na.rm) {new[is.na(x),] <- NA } #added 10/20/17
 headTail(new) #show the first  and last 4 rows
 xlev <- as.factor(x)   #show the str of this
 str(xlev)  #because we will be using these to index the matrix
 for (i in 1:n.obs) {new[i,xlev[i]] <- 1}    #Do the work
  headTail(new) #show the first  and last 4 rows
  colnames(new) <- names(t)    #if we have names for the categories, they appear here
```

Now try this function on `spi["education"]

What are the units of education?  We can see them by looking at the help file for the *spi* data set.
Respondents education: less than 12, HS grad, current univ, some univ, associate degree, college degree, in grad/prof, grad/prof degree


```{r}
edu.codes <- dummy.code(spi$education)
dim(edu.codes)  #we have created the new variables
describe(edu.codes)   #the means should add up to 1 across all the variables
```

##Empirical scale construction 
Another function, `bestScales` can use those codes to find which items most correlate with each category.  This function finds the items that most correlate with the criterion.

```{r}
bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary)  #don't use education as a predictor
bs.edu
```

##The need to *boot strap* the solution
The solution just listed is sensitive to sample variations.  We can test how sensitive it is by using the *boot strap*.  Boot strapping is basically taking the observed sample as the population and then sampling from that population multiple times (with replacement).  This leads to a numer of samples from the 'population' that differ to some amount.  By examining how much they vary we can get a degree of confidence in our results.  We will do the bootstrap 100 times and then examine the pooled results.

```{r}
bs.edu <- bestScales(spi[-6],edu.codes,dictionary=spi.dictionary[1:2], n.iter=100)
bs.edu
```