--- title: "350 t and F tests and the linear model" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=110) ``` # Exploring distributions Before Gosset (1908) the primary means of testing for differences of means was to use the *z-test*. This would compare differences to what you would expect given normal theory. Unfortunately, this required at least 30 subjects to be reasonably precise. The reason for this is that the *standard error* was based upon the observed *sample standard deviation* which will be smaller than the population value. We can show this problem by comparing the population standard deviation with that of a series of samples of varying size. (We are showing what Gosset derived. ) We will first generate a random distribution and then take samples from this of varying size. We examine the standard deviation of these samples. ```{r} set.seed(42) #set a reproducible seed for our random numbers. Population <- runif(10000) #a uniform distribution hist(Population) #show what it looks like. library(psych) library(psychTools) describe(Population) ``` ## First lets show the central limit theorem One of the most amazing statistical results is known as the *central limit theorem*. For any distribution with finite variance, if we take progressively bigger samples from the population, the distribution of the *means* of those samples will tend towards the *normal* distribution. (The CLT does not apply to distributions with infinite variance. ) What follows is a very simple way of showing the CLT. Note that we define the limits of the x-axis to be the same in each figure. We take 1000 samples each of size 2, 4, 8, 16 and 32 and then plot them. This is an example of not very clean code, but simple to understand. ```{r} x2 <- (runif(1000)+ runif(1000))/2 # mean of two samples x4 <- (runif(1000)+ runif(1000) +runif(1000)+ runif(1000))/4 # mean of four samples x8 <- (runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+runif(1000)+ runif(1000) +runif(1000)+ runif(1000))/8 x16 <- (runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+ runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+runif(1000)+ runif(1000) +runif(1000)+ runif(1000))/16 x32 <- (runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+ runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+ runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+ runif(1000)+ runif(1000) +runif(1000)+ runif(1000)+runif(1000)+ runif(1000) +runif(1000)+ runif(1000))/32 hist(x2,xlim=c(0,1)) hist(x4, xlim=c(0,1)) hist(x8,xlim=c(0,1)) hist(x16,xlim=c(0,1)) hist(x32,xlim=c(0,1)) #combine them all into one data frame central.limit <- data.frame(one=Population, two=x2,four=x4,eight=x8,sixteen=x16,thirtytwo=x32) describe(central.limit) ``` Note that the mean does not change as the sample size increases, but the standard deviation decreases. Four times the sample size leads to half the standard deviation. ## An improved demonstration using R The above code is rather clunky. It will do the job but is not easy to use. Lets consider some more useful code. This code has more flexibility. It has three characteristics: 1) The first part specifies the parameters we will use 2) The next part runs a `loop' to do something multiple times 3) The final part presents the results ```{r aloop} prob.size <- c(2,4,8,16,32,64) #create a vector describing the problem size n.obs <- 10 #specify the number of subjects to simulate #create two matrices to store the results. We do this dynamically to be flexible results.mean <- matrix(NA,ncol=length(prob.size), nrow= n.obs) results.sd <- results.mean #this just sets up the place to store the results #now run a loop for (k in 1:length(prob.size)) { #do this as many times as we have problems n <- prob.size[k] #For each problem size, specify the number of samples temp <- matrix(runif(n * n.obs),ncol=n) #create a n.obs x n matrix of random numbers results.mean[,k] <- apply(temp,1,mean) #find the mean of these numbers for each row results.sd[,k] <-apply(temp,1,sd) } colnames(results.mean) <- colnames(results.sd) <- paste("N=",prob.size) describe(results.mean) describe(results.sd) multi.hist(results.mean) multi.hist(results.sd) ``` What do these results show? The mean of the sample does not change as sample size goes up, but the variance of the sample means goes down by the square root of the sample size (compare 2 to 8, 4 to 16, etc.). A more subtle change, however, is that the standard deviation of the sample values (shown as the means of *results.sd*) gradually increase as the sample size increases, tending towards the population value (.29). It is also clear that the distribution of sample standard deviations is very noisy for small samples. That is, the sample standard deviation underestimates the population standard deviation. Thus, in any test comparing means, if we use the sample standard deviation, we need to adjust for the size of the sample. This was the realization of William Gosset when he developed the *t.test*. ## Creating a little function for this demonstration. Perhaps even more convenient is to take the *script* we had before, and turn it into a little function. This allows us to call the function without worrying about the various variable names inside the function. 1 To create the function, we give it a name (in this case `central.limit`) 2 and the list of parameters to pass to the function (the number of observations and the problem size) 3 These parameters are the default values if you do not specify your own. 4 the function then is defined by everything between the { and } brackets ```{r afunction} central.limit <- function(n.obs= 1000, prob.size = c(1,2,4,8,16,32)) { #this is the beginning of the function #specify the number of subjects to simulate #create two matrices to store the results. We do this dynamically to be flexible results.mean <- matrix(NA,ncol=length(prob.size), nrow= n.obs) results.sd <- results.mean #now run a loop for (k in 1:length(prob.size)) { i <- prob.size[k] temp <- matrix(runif(i * n.obs),ncol=i) results.mean[,k] <- apply(temp,1,mean) results.sd[,k] <-apply(temp,1,sd) } colnames(results.mean) <- colnames(results.sd) <- paste("N=",prob.size) results <- list(mean=results.mean,sd=results.sd) #we are saving the results in a list with two element return(results) } #this is the last line of the function ``` ## Now run our function But specify different parameters than the default ```{r} prob.size = c(1,2,3,4,4,5,6,7,8,9,10) test <- central.limit(n.obs=10000, prob.size=prob.size) #we change the default values names(test) #what are the objects returned by test? d.sd <- describe(test$sd) #show the standard plot(d.sd$mean,typ="b",main="Sample SD is biased downwards") ``` But that graph does not show sample size varying as clearly as it really is. Try to make a clearer figure by specifying the x and y labels ```{r} sd.df <- data.frame(size=prob.size,sd=d.sd$mean) with(sd.df,plot(sd ~ size,typ="b" ,ylab="sample sd",main="Sample SD is biased downwards")) ``` ## Run the function again, with the default values ```{r} another.test <- central.limit() #we take the default values dim(another.test) names(another.test) describe(another.test$mean) describe(another.test$sd) ``` # Another demonstration of central limit - binomial as samples increase The binomial distribution is just a coin flip. For a 1,000 coins, if we plot the distributions for single coins, roughly half will be 0 the other half 1. But what if we plot the distribution of flipping two coins 1,000 times? What about 4, what about 8, 16, or 32. Lets do this by simulation. ```{r} n.obs <- 10000 size <- c(1,2,4,8,16,32) x <- matrix(NA,nrow=n.obs,ncol=length(size)) for(i in 1:length(size)) { x[,i] <- rbinom(n.obs,size[i],prob=.5)} multi.hist(x) ``` As the number of coins being flipped goes up, the average (or total in this case) becomes more normally distributed. # Gosset and the `t-distribution` Gosset recognized that the distribution of means for small samples has less variance than big samples (see above). He developed a distribution (the `t-distribution`) to account for this. Given that his employer did not to encourage others to realize how powerful this statistic was, he published anonomously under the name of `Student`. And thus we have the `Student's t distribution`. We can compare the *t distribution* with the *normal distribution* using the built in t and normal distributions. We use the *curve* function with the second plot added to it. ```{r} curve(dnorm(x),-3,3,main="Comparing the Normal to the t distribution (df=4) ",ylab="density") curve(dt(x,df=2),-3,3,lty="dotted",add=TRUE) curve(dt(x,df=3),-3,3,lty=4,add=TRUE) curve(dt(x,df=4),-3,3,lty="dashed",add=TRUE) curve(dt(x,df=8),-3,3,lty=3,add=TRUE) curve(dt(x,df=16),-3,3,lty=5,add=TRUE) ``` # The t-test The t-test, perhaps the fundamental statistical test, was developed by William Gosset who published under the name of Student (1908). The t-test is a small sample approximation to the large sample (z) test. His first example was a data set on the different effect of optical isomers of hyoscyamine hydrobromide reported by Cushny & Peebles (1905). The sleep of 10 patients was measured without any drug and then following administration of D. and L isomers. The data from Cushny are available as the cushny data set. Before we discuss the t-test, we consider the *boot strap* as an alternative. ```{r} library(psychTools) cushny ``` Note that the last 3 columns are just the differences from the control condition. ## Desciptive statistics of the Cushny data ```{r} describe(cushny) ``` It is hard to see patterns in simple descriptives, but a graphical display will help. ## A graphical display of the data We can show these data graphically using the `error.bars` function. We pass labels to the x and y axis using the *xlab* and *ylab* parameters, and then supply an appropriate figure title (*main*). We will use these data to show how to do t-tests as well as the generalization to Analysis of Variance. First we show this the tradition way in terms of error bars, and then we add in `cats eyes` to show the range of confidence. ```{r} error.bars(cushny,xlab="Group",ylab="hours of sleep", main="The effect of drug upon sleep (95% confidence)",eyes=FALSE) ``` ```{r} error.bars(cushny,xlab="Group",ylab="hours of sleep", main="The effect of drug upon sleep (95% confidence)",col=rainbow(8))#the default is to show cats eyes ``` # Bootstrap resampling Gosset/Student developed the t-test and derived the t-distribution. We can also find out what happens if we do many random samples from the data. We think of the observed data as our *population* and then sample from that population. We do the sampling *with replacement* because otherwise we would always just end up with the same data. This involves using the `sample` function to take random samples from the original data (with replacement). As an example, consider sampling from the numbers 1-10. As usual, because we want to see and discuss the results, we set the random seed to a specific value (I tend to use 42 because then I can cite Adams (1962).) In a production run, do not set the seed and it is based on the computer clock (ms since startup). ```{r} set.seed(42) #to get a random ordering that can be used for teaching purposes. original <- 1:10 sample(original,10) #don't specify replacement and you get a permutation. sample(original,10, replace=TRUE) #notice the repetition of 10 sample(original,10, replace=TRUE) # which happened again! sample(original,10, replace=TRUE) # now we get a repetition of 1 and 7 ``` ## Boostraping the cushny data Now, use this trick of sampling to choose subjects, find the mean of the columns for each sample, and then show those. We create a baby function. This is understandable but not elegant. ```{r smallfunction} small <- function(data,n.trials=100) { #name the function and give the default value of 100, subject <- 1:NROW(data) # dynamically set this results <- matrix(NA,ncol=NCOL(data),nrow=n.trials) #we store the results here #get it ready to store the data for (i in 1:n.trials) {#we are going to execute a loop new.order <- sample(subject,NROW(data),replace=TRUE) #find a new order results[i,] <- colMeans(data[new.order,]) #find the column means for each variable } # end of loop colnames(results) <- colnames(data) #name the results return(results) } #end of small function #now run the function temp <- small(cushny,1000) #specify that we want 1000 samples error.bars(temp,sd=TRUE,eyes=FALSE) #draw the results describe(temp,quant=c(.025,.5,.975)) #show the top and bottom 2.5% ``` Note how the top and bottom 2.5% quantiles of the differences (delta1, delta2L and delta2R) do (delta1) or do not (delta2L and delta2R) include 0. We conclude that it unlikely that the latter two differences are equal to 0. Also note that the 95% confidence intervals of the drugs do not overlap with that of the control. # Student's t.test as done by Student Unfortunately, Student did not have R. Therefore he developed a test (comparing the means to their standard errors) and the distribution of that test (the *t-distribution*) The t.test is a function core-R. It may be used in several different ways. ```{r} #this first approach uses the with command that allows us to specify just the variable name with(cushny,t.test(delta1)) #control versus drug 1 difference scores with(cushny,t.test(delta2L)) #control versus drug2L difference scores with(cushny,t.test(delta1,delta2L,paired=TRUE)) #difference of differences #or, we can just do it directly by specifying the object and the variable name t.test(cushny$delta1) t.test(cushny$delta2L) ``` ## The previous error bars do not reflect the correlated nature of the data ```{r} lowerCor(cushny[1:4]) #show the correlations #plot the error bars correcting for the correlation error.bars(cushny[1:4],xlab="Group",ylab="hours of sleep", main="The effect of drug upon sleep (95% confidence)",within=TRUE, col=rainbow(4)) ``` ### Show this as a two panel graph, don't use cats eyes We can compare these two approaches in one two panel graph. We specify the number of rows and columns using the *par(mfrow=c(r,c)) * command. ```{r} op <- par(mfrow=c(1,2)) #one row, two columns error.bars(cushny[1:4],xlab="Group",ylab="hours of sleep", main="Not repeated measures",eyes=FALSE,ylim=c(1,8)) error.bars(cushny[1:4],xlab="Group",ylab="hours of sleep", eyes=FALSE, main="repeated measures ",within=TRUE,ylim=c(1,8)) op <- par(mfrow=c(1,1)) #set this back to the default title("Effect of drug upon sleep",line=3) #add a general title ``` ```{r} error.bars(cushny[1:4],xlab="Group",ylab="hours of sleep", main="Not repeated measures",eyes=FALSE,ylim=c(1,8)) error.bars(cushny[1:4],xlab="Group",ylab="hours of sleep", eyes=FALSE, main="repeated measures ",within=TRUE,ylim=c(1,8),add=TRUE,arrow.col="blue") title("Effect of drug upon sleep",line=3) #add a general title ``` ## Reorganization of the data frame: from wide to long We can take the *wide* format of the cushny data set and make it *long*. The long format is used in a variety of statistical tests. The *stack* command will do this for us. ```{r} long.sleep <- stack(cushny[c("delta1","delta2L")]) long.sleep #display the long format ``` ### Now do the t-test on these data ```{r} t.test(values ~ ind,data=long.sleep) #ignoring the paired nature of the data ## Paired t-test ## The sleep data is actually paired, so could have been in wide format: sleep2 <- reshape(sleep, direction = "wide", idvar = "ID", timevar = "group") ## Traditional interface t.test(sleep2$extra.1, sleep2$extra.2, paired = TRUE) ## Formula interface t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2) ``` ## The Cushny data are also available in the `sleep` data set. This is perhaps a obvious way of organizing the data (in long format) ```{r} sleep ``` ### Do the t.test again, but on the sleep data ```{r} with(sleep,t.test(extra~group)) #does not assume equal variances with(sleep,t.test(extra~group,var.equal=TRUE)) #assume equal variances # with(sleep,t.test(extra~group,paired=TRUE)) #noticing that the data were paired ``` # Analysis of Variance (aov) is a special case of the the linear model and is a generalization of the t-test was developed as a way of analyzing agricultural data (hence the terminology of such things as a split plot design). Thus one could analyze multiple seeds crossed with multiple fertilizers. The application to psychology is particular used in the case of experimental designs where two or more variables are experimentally "crossed". It is important to understand although most modern approaches generalize it to the case of the linear model. Rather than directly comparing means (the way t-test does), it considers the variance of the means and the variance within groups. The terminology is a decomposition of the *Sums of Squares* (SS) in terms of Sums of Squares between groups and Sums of Squares within groups. When divided by their *degrees of freedom*, Sums of Squares become *Mean Squares* which is an estimate of variance. The ratio of the these two variances came to be called the F-test (in honor of Ronald Fisher). Fisher generalized the t-test to the case of more than two groups. 1. `aov` provides a wrapper to `lm` for fitting linear models to balanced or unbalanced experimental designs. 2. The main difference from `lm` is in the way print, summary and so on handle the fit: this is expressed in the traditional language of the analysis of variance rather than that of linear models. 3. If the formula contains a single Error term, this is used to specify error strata, and appropriate models are fitted within each error stratum. 4. The formula can specify multiple responses. 5. `aov` is designed for balanced designs, and the results can be hard to interpret without balance: beware that missing values in the response(s) will likely lose the balance. 6. If there are two or more error strata, the methods used are statistically inefficient without balance, and it may be better to use lme in package `nlme`. ## for the sleep data set, compare with the t.test results ```{r} #independent subjects summary(aov(extra ~ group, data=sleep)) #t.test t.test(extra ~ group, equal.var=TRUE, data =sleep) #correlated subjects summary(aov(extra~group + Error(ID),data=sleep)) #and the associated t.test # t.test(extra ~ group ,data=sleep ,paired=TRUE) ``` ## The *lm* function is the linear model. Compare this to the results ```{r} model <- lm(extra ~ group,data=sleep) model #reports some of the statistics summary(model) # a more common way of describing the results ``` ### Note that both the lm and the are not treating the repeated measures quality of the data The *aov* and *lm* analyses did not treat the repeated measures of the data appropriately. This can be done with repeated measures or *lme* for the linear model. ## A multifactorial design (Taken from the help menu for aov) Consider yield (remember, this was developed in agriculture) as a function of Nitrogen, Potassium and Phosphate. What is the effect of each variable seperately and in combination. ```{r} headTail(npk) #show a little bit of the data pairs.panels(npk) # A good experiment makes the experimental conditions orthogonal npk #The data come from Venables mod1 <- aov(yield ~ N,data=npk) mod2 <- aov(yield ~ N+ P + N*P,data=npk) mod2a <- aov(yield ~N*P,data=npk) mod3 <- aov(yield ~ N*P*K,data=npk) mod4 <- aov(yield ~ block + N*P*K,data=npk) summary(mod1) #Just the effect of N summary(mod3) #don't include the block variable summary(mod4) #The fully crossed design ``` ## lm treats the blocking variable somewhat differently ```{r} mod5 <- lm(yield ~ N*P*K, data=npk) summary(mod5) ``` More importantly, the interaction terms affect the main effects. We need to zero center (mean center) the data. To do this we have to convert the npk data into numeric values and then center them. ```{r} npk.nolevels <- levels2numeric(npk) npk.cen <- data.frame(scale(npk.nolevels,scale=FALSE)) #now, do the lm mod6 <- lm(yield ~ N*P*K,data=npk.cen) summary(mod6) #compare this to model 3 ``` ## AOV Another example aov is designed for balanced designs, and the results can be hard to interpret without balance: beware that missing values in the response(s) will likely lose the balance. We take an example where the data are on our web server ```{r} datafilename="http://personality-project.org/r/datasets/R.appendix2.data" data.ex2=read.file(datafilename) #read the data into a data.frame data.ex2 #show the data #do the analysis of variance aov.ex2 = aov(Alertness~Gender*Dosage,data=data.ex2) summary(aov.ex2) #and summarize it aov.ex2 #this shows just the coefficients ``` ## But typically, we want to show the means ```{r} print(model.tables(aov.ex2,"means"),digits=3) #show the summary table ``` ## within subjects 1. Somewhat more complicated because we need to convert “wide”data.frames to“long”or“narrow”data.frames. 2. This can be done by using the stack function. Some data sets are already in the long format. 3. A detailed discussion of how to work with repeated measures designs is at http://personality-project.org/r/r..html and at http://personality-project.org/r 4. See also the tutorial by Jason French at http:// jason-french.com/tutorials/repeatedmeasures.html 5. Many within subject designs can be treated as multi-level designs. For a discussion of analyzing multilevel data (particularly for personality dynamics), see http://personality-project.org/revelle/ publications/rw.paid.17.final.pdf ### Analysis of variance within subjects: Getting and showing the data ```{r} filename="http://personality-project.org/r/datasets/R.appendix5.data" data.ex5=read.file(filename) #read the data into a data.frame headTail(data.ex5,6,12) #Show the data (first 6, last 12) describe(data.ex5) pairs.panels(data.ex5) #this shows that the design is balanced ``` ## and do the within subjects ```{r} filename="http://personality-project.org/r/datasets/R.appendix5.data" data.ex5=read.table(filename,header=TRUE) #read the data into a table #do the aov.ex5 = aov(Recall~(Task*Valence*Gender*Dosage)+Error(Subject/(Task*Valence))+ (Gender*Dosage),data.ex5) #look at the output summary(aov.ex5) ``` ## Actually, we could have done all of this from the linear model (coming Wednesday)