Statistics in Research Methods: Using R

This is a very brief guide to help students in a research methods course make use of the R statistical language to analyze some of the data they have collected. For more extensive tutorials of R in psychology, see my short and somewhat longer tutorials as well as the much more developed tutorial by Jonathan Baron and Yuelin Li. (Note that the blue text may be directly copied into R)

9 simple steps

Install R on your computer or go to a machine that has it
Enter your data using a text editor and save as a text file (perhaps comma delimited if using Excel). Note that the computer simulation we did has already stored it in a readable format.
Start R, read (or paste) the data file
Find basic descriptive statistics (e.g., means, standard deviations, minimum and maxima)
Prepare a simple descriptive graph (e.g, a box plot) of your variables
Find the correlation matrix to give an overview of relationships
If you have categorical experimental variables, you can do an Analysis of Variance (ANOVA)
If you have a continuous independent variable, do the appropriate multiple regression using standardized scores.
Graph the results (advanced)

Getting started

Installing R on your computer

Many labs have R already installed. However, to do analyses on your own machine, you will need to download the R program from the R project and install it yourself. Go to the R home page and then choose the Download from CRAN (Comprehensive R Archive Network) option. This will take you to list of mirror sites around the world. A convenient mirror for those of us at Northwestern is at UCB: http://http://cran.cnr.berkeley.edu/ You may download the Windows, Linux, or Mac versions at this site. Installation differs across platforms but, at least for a Mac, is very easy.

Run R with a text editor to save your commands for later use. Or, use the edit window in R to write out your commands as you work. Save this file for later use.

R is case sensitive and does not give overly useful diagnostic messages. If you get an error message, don't be flustered but rather be patient and try the command again using the correct spelling for the command.

Help and Guidance

When in doubt, use the help () function. This is identical to the ? function where function is what you want to know about. e.g.,


? read.table    #ask for help in using the read.table function  -- see the answer in its own window
#or 
help (read.table) #another way of asking for help.  - see the window

Useful additions to R

One of the advantages of R is that it can be supplemented with additional programs that are included as "packages" using the package manager. (e.g., sem does structural equation modeling). A package of procedures (psych) that I have found most can be installed using the "Package Installer" and "Package Manager". On the Mac, these are both options found under the Package and Data Menu.

The psych package may be downloaded from CRAN either by using the package installer menu option or by directly installing it:

 

install.packages("psych")

Entering the data

For most data analysis, rather than manually enter the data into R, it is probably more convenient to use a spreadsheet (e.g., Excel or OpenOffice) as a data editor, save as a tab or comma delimited file, and then read the data from that file or read from the clipboard using the read.clipboard() command. (found in the psych package).

For analysing the data from the computer simulation, the data are already in a readable form, with labels for the variables in the first row. If you have run multiple experiments, there are some comment lines that need to be removed before reading into R. Most of the examples in this tutorial assume that the data have been entered this way. Many of the examples in the help menus have small data sets entered using the c() command or created on the fly.

For the example, we read data from the local computer for 100 cases of simulated data with 5 independent variables and 3 dependent variables. In addition, the simulation has already done some recoding, converting two of the independent variables into 2 (IMP_2) or 3 (TOD_3) categories.

To read a file from your local machine, change the datafilename to specify the path to the data. A simple trick on the Mac is to store the datafile on your desktop. Thus the path is "Desktop/myfilename".

If you have opened the file using a text editor or Excel, you can just copy the appropriate rows into the clipboard and then use the "read.clipboard()" command.




#If I want to read a datafile from my desktop
datafilename <- "Desktop/205 Simulations Stuff/Simulation.Data" 

#now read the data file
sim.data  <- read.table(datafilename,header=TRUE)  #read the data file

#Alternatively, to read in a comma delimited file:
#my.data  <- read.table(datafilename,header=TRUE,sep=",")  

#alternatively, copy the data into the clipboard and
sim.data <- read.clipboard()



names(sim.data)  #list the names of the variables
attach(sim.data)  #makes the variables available for later analysis



#results in this output

> datafilename <- "Desktop/Simulation.Data" 
> sim.data  <- read.table(datafilename,header=TRUE)  #read the data file
> names(sim.data)  #list the names of the variables
 [1] "trials"      "drug"        "time"        "anxiety"     "impulsivity" "sex"         "arousal"     "tension"    
 [9] "performance" "IMP_2"       "TOD_3"      
> attach(sim.data)  #makes the variables available for later analysis

V dim(sim.data)     #how many variables and how many subjects?
[1] 100  11

Descriptive statistics

If you first make the psych package active, then we can describe the data:


library(psych)   #you only need to do this once.

describe(sim.data)
             mean    sd  skew   n median   se
trials      50.50 29.01  0.00 100   50.5 2.90
drug         0.44  0.50  0.24 100    0.0 0.05
time        15.19  4.47 -0.08 100   15.0 0.45
anxiety      5.42  1.95 -0.24 100    5.0 0.20
impulsivity  5.03  2.14  0.18 100    5.0 0.21
sex          1.46  0.50  0.16 100    1.0 0.05
arousal     64.29  7.21 -0.51 100   66.0 0.72
tension     58.68  7.16 -0.28 100   59.0 0.72
performance 80.00 12.23 -0.85 100   83.0 1.22
IMP_2        1.53  0.50 -0.12 100    2.0 0.05
TOD_3        2.12  0.77 -0.20 100    2.0 0.08


alternatively, we can simply summarize them:

summary(sim.data)
     trials            drug           time          anxiety       impulsivity         sex          arousal         tension     
 Min.   :  1.00   Min.   :0.00   Min.   : 8.00   Min.   : 0.00   Min.   : 0.00   Min.   :1.00   Min.   :48.00   Min.   :40.00  
 1st Qu.: 25.75   1st Qu.:0.00   1st Qu.:12.00   1st Qu.: 4.00   1st Qu.: 4.00   1st Qu.:1.00   1st Qu.:58.75   1st Qu.:54.00  
 Median : 50.50   Median :0.00   Median :15.00   Median : 5.00   Median : 5.00   Median :1.00   Median :66.00   Median :59.00  
 Mean   : 50.50   Mean   :0.44   Mean   :15.19   Mean   : 5.42   Mean   : 5.03   Mean   :1.46   Mean   :64.29   Mean   :58.68  
 3rd Qu.: 75.25   3rd Qu.:1.00   3rd Qu.:19.00   3rd Qu.: 7.00   3rd Qu.: 7.00   3rd Qu.:2.00   3rd Qu.:70.00   3rd Qu.:63.00  
 Max.   :100.00   Max.   :1.00   Max.   :22.00   Max.   :10.00   Max.   :10.00   Max.   :2.00   Max.   :75.00   Max.   :74.00  
  performance      IMP_2          TOD_3     
 Min.   : 45   Min.   :1.00   Min.   :1.00  
 1st Qu.: 73   1st Qu.:1.00   1st Qu.:2.00  
 Median : 83   Median :2.00   Median :2.00  
 Mean   : 80   Mean   :1.53   Mean   :2.12  
 3rd Qu.: 88   3rd Qu.:2.00   3rd Qu.:3.00  
 Max.   :100   Max.   :2.00   Max.   :3.00  
 
 
 
> summary(sim.data[,2:5])    #summarize a subset 
      drug           time          anxiety       impulsivity   
 Min.   :0.00   Min.   : 8.00   Min.   : 0.00   Min.   : 0.00  
 1st Qu.:0.00   1st Qu.:12.00   1st Qu.: 4.00   1st Qu.: 4.00  
 Median :0.00   Median :15.00   Median : 5.00   Median : 5.00  
 Mean   :0.44   Mean   :15.19   Mean   : 5.42   Mean   : 5.03  
 3rd Qu.:1.00   3rd Qu.:19.00   3rd Qu.: 7.00   3rd Qu.: 7.00  
 Max.   :1.00   Max.   :22.00   Max.   :10.00   Max.   :10.00  
>

Correlations and scatter plots

To find the correlation between a set of variables, use the cor command:


round(cor(sim.data),2)    #round the correlations to 2 decimals


            trials  drug  time anxiety impulsivity   sex arousal tension performance IMP_2 TOD_3
trials        1.00  0.00 -0.05    0.06       -0.03 -0.05   -0.01   -0.33        0.06  0.06 -0.06
drug          0.00  1.00  0.14   -0.04        0.18 -0.05    0.49    0.42        0.38  0.07  0.20
time         -0.05  0.14  1.00    0.15        0.04 -0.12    0.78    0.14        0.70  0.00  0.94
anxiety       0.06 -0.04  0.15    1.00        0.06 -0.12    0.19    0.42        0.20 -0.02  0.13
impulsivity  -0.03  0.18  0.04    0.06        1.00 -0.13    0.03    0.05        0.03  0.82  0.02
sex          -0.05 -0.05 -0.12   -0.12       -0.13  1.00   -0.24   -0.07       -0.24 -0.14 -0.09
arousal      -0.01  0.49  0.78    0.19        0.03 -0.24    1.00    0.24        0.87  0.00  0.80
tension      -0.33  0.42  0.14    0.42        0.05 -0.07    0.24    1.00        0.19 -0.10  0.18
performance   0.06  0.38  0.70    0.20        0.03 -0.24    0.87    0.19        1.00  0.00  0.70
IMP_2         0.06  0.07  0.00   -0.02        0.82 -0.14    0.00   -0.10        0.00  1.00 -0.01
TOD_3        -0.06  0.20  0.94    0.13        0.02 -0.09    0.80    0.18        0.70 -0.01  1.00

For a limited number of variables, we can also show scatter plot matrices of all the relationships.


pairs.panels(sim.data)   #shows the scatter plots and correlations of all the variables

To show the relationship between any two variables use plot.


plot(impulsivity,performance)

Graphical output goes to a graphics window. This may be saved as a pdf using the save command (in the menus) and then printed in another document. On a mac, a new graphics window may be created by the quartz(height=6, width=6) command (height and width specified for nice looking output on a printer.)
quartz(height=6, width=6)

The relationship between variables is seen graphically in the plot command (see above). It is quantified with the Pearson Correlation. In case of missing values, it is necesary to specify that you want to use complete cases or pairwise deleted cases. Without rounding, the output is to 5 decimals. Rounding to 2 decimals is most appropriate.

Analysis of Variance

For analysis of variance of categorical variables, we can use the aov() function:


 aov.1 <- aov(arousal ~ drug,data=sim.data)  #one way ANOVA of drug
 summary(aov.1)   #summary table
 print(model.tables(aov.1,"means"),digits=3)  #the means
 
  aov.2 <- aov(arousal ~ TOD_3,data=sim.data)  #one way ANOVA of time of day
 summary(aov.2)   #summary table
 print(model.tables(aov.2,"means"),digits=3)  #the means
 
 aov.3 <- aov(arousal ~ drug * TOD_3,data=sim.data)   # 2 way ANOVA 
 summary(aov.3)
 print(model.tables(aov.3,"means"),digits=3) 
 
 
 which produces this output:
 
  aov.1 <- aov(arousal ~ drug ,data=sim.data)  
  print(model.tables(aov.1,"means"),digits=3) 
Tables of means
Grand mean
      
64.29 

 drug 
drug
   0    1 
61.2 68.3 

 TOD_3 
TOD_3
   1    2    3 
57.0 63.5 70.0 

 drug:TOD_3 
    TOD_3
drug 1    2    3   
   0 54.5 61.3 68.0
   1 59.4 66.3 73.1

>  summary(aov.1)
            Df  Sum Sq Mean Sq  F value    Pr(>F)    
drug         1 1246.31 1246.31  91.5979 1.244e-15 ***
TOD_3        1 2594.00 2594.00 190.6467 < 2.2e-16 ***
drug:TOD_3   1    0.08    0.08   0.0058    0.9397    
Residuals   96 1306.21   13.61                       
---
Signif. codes:  0 Ô***Ő 0.001 Ô**Ő 0.01 Ô*Ő 0.05 Ô.Ő 0.1 Ô Ő 1 
>

Graphics

To produce graphs from this analysis, we can use the plot and lines commands. We will do 2 graphs, one of the effect of Time of Day on arousal, the other of the (non-significant) interaction of the drug by time of day.




tod <- c(8,15,22)   #the values of the X axis
arouse <- c(57.0, 63.5, 70.0)  #copy the results from the means output
plot(tod,arouse,type="b", ylab="arousal",xlab="time of day",ylim=c(0,100))

Graphing interactions

The next set of instructions allow us to consider the interaction of categorical variables with continuous variables. The data are from a simulation of performance varying by time of day and impulsivity. (Note, for the analysis of the 205 simulation, you need to make sure you use your data, not the example data set used here.)

One can graph the data points and the regression lines (which I show), or just the regression lines.




datafilename="http://personality-project.org/r/datasets/simulations.txt"  #web copy of data file
sim.data =read.table(datafilename,header=TRUE)  #read the data file
attach(sim.data)     #make the sim.data the active set

#the next commands allow for a more generic set of plotting commands
#rather than replace the variabless DV, IV1, and IV2, just enter their names in models.

DV=performance   #The Dependent Variable
IV1=impulsivity  #continuous Independent Variable
IV2=time        #categorical Independent Variable

mydata=data.frame(DV,IV1,IV2)   #put them together into one data frame
attach(mydata)

zmydata=data.frame(scale(mydata))             #convert the data to standard scores to do the regression analysis
                                              #this makes the interaction term independent of the main effects
model=lm(DV~IV1+IV2+ IV1*IV2,data=zmydata)    #test the main effects of IV1 and IV2 and the interaction of IV1 and IV2
print(model)              #just show the coefficients
summary(model)            #show the coefficients as well as the probability estimates for the coefficients 



by(mydata,IV2,function(x) summary(lm(DV~IV1,data=x))) #give the summary stats for the regression DV~IV1 for each value of IV2

par(mfrow=c(1,1))                      #set the graphics back to a 1 x 1 plot 
symb=c(19,25,3,23)                     #choose some nice plotting symbols
colors=c("black","red","green","blue") #choose some nice colors

cat.iv=as.integer(IV2 < 10) +1       #convert IV2 to a categorical variable with 2 values of 1 and 2 
          
                              #now draw the DV~IV1 regression for each value of IV2
plot(IV1,DV,pch=symb[cat.iv],col=colors[cat.iv],cex=1.0,xlab="label for x axis",ylab="label for y axis",main="overall label goes here")   #show  the data points
by(mydata,IV2,function(x) abline(lm(DV~IV1,data=x)))  #show the best fitting regression for each group

text(6,60,"label for cat.iv=1")      #adjust the x, y coordinates to fit the graph
text(6,90,"label for cat.iv =2 ")

It is also easy to draw plots of specific points connected by lines. Consider if we have the experimental results on a Dependent Variable (DV) as a function of two Independent Variables (IV1 and IV2). Specifically, consider the results from aov.3 (above). Although in this example I refer to them generically, obviously one should change the labels to fit the data.)

	IV 1 (time of day)
	Low	Medium	High
IV 2 Low (Placebo)	54.5	61.3	68.0
IV 2 High (Caffeine	59.4	66.3	73.1

These results can be drawn using plot and lines. Create a variable IV1 for the values on the X axis. Then create two sets of coordinates of the DV, one for each level of the IV2. Then, use plot and lines.



quartz(height=6,width=6)   #create a new graphics window for this plot (mac)
IV1 <- c(8,15,22)    #the values of the first Independent Variable (time of day)

iv2low=c(54.5, 61.3 ,68.0)  #the Dependent variable values for low, medium and high IV 1 and low IV2
iv2high=c(59.4, 66.3, 73.1) #the Dependent variable values for low, mediuam and high IV 1 and high IV2
IV1 <- c(8,15,22)    #the values of the first Independent Variable

plot(IV1,iv2low,xlim=c(8,22),ylim=c(0,100),type="b",lty="dashed", xlab="Independent Variable 1" ,ylab="Dependent Variable",  pch=19,main="some title goes here")    #plot character is closed circle

lines(IV1,iv2high,,pch=22,type="b") #plot character is an open square
text(15,60,"DV2 = low")    #change the location of these labels to fit the data
text(15,70,"DV2 = high")

Making multiple panel graphs

Suppose we have a three way interaction and want to make a two panel graph. Then we specify that we want a graphic with 2 columns and 1 row, and then draw the graphics twice.

par(mfrow=c(1,2))    #a 1 row, 2 column graph 

IV1 <- c(8,15,22)    #the values of the first Independent Variable (time of day)

iv2low=c(54.5, 61.3 ,68.0)  #the Dependent variable values for low, medium and high IV 1 and low IV2
iv2high=c(59.4, 66.3, 73.1) #the Dependent variable values for low, mediuam and high IV 1 and high IV2
IV1 <- c(8,15,22)    #the values of the first Independent Variable

plot(IV1,iv2low,xlim=c(8,22),ylim=c(0,100),type="b",lty="dashed", xlab="Independent Variable 1" ,ylab="Dependent Variable",  pch=19,main="some title goes here")    #plot character is closed circle

lines(IV1,iv2high,,pch=22,type="b") #plot character is an open square
text(15,60,"DV2 = low")    #change the location of these labels to fit the data
text(15,70,"DV2 = high")

#now do the second figure (aribitray numbers)

IV1 <- c(8,15,22)    #the values of the first Independent Variable (time of day)

iv2low=c(44.5, 71.3 ,68.0)  #the Dependent variable values for low, medium and high IV 1 and low IV2
iv2high=c(69.4, 66.3, 53.1) #the Dependent variable values for low, mediuam and high IV 1 and high IV2
IV1 <- c(8,15,22)    #the values of the first Independent Variable

plot(IV1,iv2low,xlim=c(8,22),ylim=c(0,100),type="b",lty="dashed", xlab="Independent Variable 1" ,ylab="Dependent Variable",  pch=19,main="another title goes here")    #plot character is closed circle

lines(IV1,iv2high,,pch=22,type="b") #plot character is an open square
text(15,40,"DV2 = low")    #change the location of these labels to fit the data
text(15,80,"DV2 = high")

More extensive discussions of the use of R in psychology may be found in my short and somewhat longer tutorials as well as the much more developed tutorial by Jonathan Baron and Yuelin Li. Finally, when in doubt, it doesn't hurt to consult the help pages or do a "help.search("some string") command.

part of a A short guide to R
Version of November 17, 2005
William Revelle
Department of Psychology
Northwestern University