|
Introduction to R(For a very abbreviated form of this guide meant to help students do basic data analysis in a personality research course, see a very short guide. In addition, a short guide to data analysis in a research methods course offers some more detail on graphing.)There are many possible statistical programs that can be used in psychological research. They differ in multiple ways, at least some of which are ease of use, generality, and cost. Some of the more common packages used are Systat, SPSS, and SAS. These programs have GUIs (Graphical User Interfaces) that are relatively easy to use but that are unique to each package. These programs are also very expensive and limited in what they can do. Although convenient to use, GUI based operations are difficult to discuss in written form. When teaching statistics or communicating results to others, it is helpful to use examples that may be used by others using different computing environments and perhaps using different software. This set of brief notes describes an alternative approach that is widely used by practicing statisticians, the statistical environment R. This is not meant as a user's guide to R, but merely the first step in guiding people to helpful tutorials. I hope that enough information is provided in this brief guide to make the reader want to learn more. (For the impatient, an even briefer guide to analyzing data for personality research is also available.) It has been claimed that "The statistical programming language and computing environment S has become the de-facto standard among statisticians. The S language has two major implementations: the commercial product S-PLUS, and the free, open-source R. Both are available for Windows and Unix/Linux systems; R, in addition, runs on Macintoshes." From John Fox's short course on S and R. The R project, based upon the S and S+ stats packages, has developed an extremely powerful set of "packages" that operate within one program. Although described as merely "an effective data handling and storage facility [with] a suite of operators for calculations on arrays, in particular, matrices" R is, in fact, a very useful interactive package for data analysis. When compared to most other stats packages used by psychologists, R has at least three compelling advantages: it is free, it runs on multiple platforms (e.g., Windows, Unix, Linux, and Mac OS X and Classic), and combines many of the most useful statistical programs into one quasi integrated program. (R is free software as part of the GNU Project. That is, users are free to use, modify, and distribute the program, within the limits of the GNU non-license). The program itself and detailed installation instructions for Linux, Unix, Windows, and Macs are available through CRAN (Comprehensive R Archive Network). Although many run R as a language and programming environment, there are Graphical User Interfaces (GUIs) available for PCs, Linux and Macs. See for example, R Commander by John Fox. (Stefano M. Iacus and Simon Urbanek have done wonders in converting R to the Mac, and there is now a semi-GUI available as Mac release of R 2.7.0.) (A note on the numbering system: The R-development core team releases an updated version of R about every six months. That is, the current version of 2.7.0 will be replaced with 2.8.0 sometime in the fall of 2008. Bug fixes are then added with a sub version number (e.g. 2.7.1 will fix minor problems with 2.7.0). R is an integrated, interactive package for data manipulation and analysis that includes functions for standard descriptive statistics (means, variances, ranges) and also includes useful tools for Exploratory Data Analysis. In terms of inferential statistics it has many varieties of the General Linear Model including the conventional special cases such as Analysis of Variance and MANOVA. Advanced features include correlational packages for multivariate analyses including Factor and Principal Components Analysis, and cluster analysis. Advanced multivariate analyses packages that have been contributed to the R project include Structural Equation Modeling, Hierarchical Linear Modeling and taxometric analysis. All of these are available in the free "packages" distributed by the R group. Statisticians and statistically minded people around the world have contributed packages to the R Group and maintain a very active news group offering suggestions and help. In addition to be a package of routines, R is a interpreted programming language that allows one to create specific functions when needed. R is also an amazing program for producing statistical graphics. A collection of some of the best graphics is available at addictedtoR with a complete gallery of thumbnail of figures. |
An introduction to R is available as a pdf or as a paper back. It is worth reading and rereading. Once R is installed on your machine, the introduction may be obtained by using the help.start command. More importantly for the novice, a number of very helpful tutorials have been written for R. (e.g., the tutorial by John Verzani to help one teach introductory statistics and learn R at the same time--this has now been published as a book, but earlier versions are still available online), or (by Jonathan Baron and Yuelin Li) to use R in the Psychology lab. The Baron and Li tutorial is the most useful for psychologists trying to learn R. Another useful resource is John Fox's short course on R. For a more complete list of brief (<100 pages) and long( > 100 pages) tutorials, go to the "contributed" section of the CRAN. More locally, Roger Ratcliff and various graduate students developed a tutorial on how to do Analysis of Variance using S which may be used in R as well. I have taken their examples and adapted their code to show how to use R for ANOVAs. This guide was developed to help others learn R, and also to help students in Research Methods, Personality Research, Psychometric Theory, Structural Equation Modeling, or other courses do some basic statisics.
(Northwestern University specific) Local copies of R are available on the Macs in the Personality-Motivation-Cognition (PMC) lab. There is also a web based version for a course in statistics that I have not tried available from Academic Computing. In addition, R is available at the Social Science Computer Cluster. It is probably more useful, however, to download a copy to your own machine. To download a copy of the software, go to the download section of the Cran.rproject.org site. For Mac users, version 2.7.0 requires OS 10.4. Installation (at least on Macs) is very straightforward. Installation on Windows systems s or Unix is presumably equally easy, but I don't know.
This set of notes relies heavily on "An introduction to R" by Venables, Smith and the R Development Core Team, the very helpful Baron and Li guide and the teaching stats page of John Verzani. Their pages have helped me learn a little bit of R. There is at least one text book in using R: "Modern Applied Statistics with S" by Brian D. Ripley, William N. Venables". My psychometrics text (in progress) has examples in R.
The R help listserve is a very useful source of information. Just lurking will lead to the answers for many questions. Much of what is included in this tutorial was gleaned from comments sent to the help list. The most frequently asked questions have been organized into a FAQ. The archives of the help group are very useful and should be searched before asking for help. Jonathan Baron maintains a searchable data base of the help list serve. (Before asking for help, it is important to read the posting guide as well as "How to Ask Questions the Smart Way"
For a thoughtful introduction to R for SPSS and SAS users, see the tutorial developed at the University of Tennessee by Bob Muenchen. For a comparison of what is available in R to what is available in SPSS or SAS, see his table comparing the features of R to SPSS.
Commands are entered into the "R Console" window. You can add a comment to any line by using a #. The Mac version has a text editor window that allows you to write, edit and save your commands. Alternatively, if you use a normal text editor (As a Mac user, I use BBEDIT, PC users can use Notepad), you can write out the commands you want to run, comment them so that you can remember what they do the next time you run a similar analysis, and then copy and paste into the R console.
Although being syntax driven seems a throwback to an old, pre Graphical User Interface type command structure, it is very powerful for doing production statistics. Once you get a particular set of commands to work on one data file, you can change the name of the data file and run the entire sequence again on the new data set. This is is also very helpful when doing professional graphics for papers. In addition, for teaching, it is possible to prepare a web page of instructional commands that students can then cut and paste into R to see for themselves how things work. That is what may be done with the instructions on this page. It is also possible to write text in latex with embedded R commands. Then executing the Sweave function on that text file will add the R output to the latex file. This almost magical feature allows rapid integration of content with statistical techniques. More importantly, it allows for "reproducible research" in that the actual data files and instructions may be specified for all to see.
As you become more adept in using R, you will be tempted to enter commands directly into the console window. I think it is better to keep (annotated) copies of your commands to help you next time.
Command syntax tends to be of the form:
variable = function (parameters) or
variable <- function (parameters) the = and the <- symbol imply replacement, not equality
the preferred style seems to be to use the <- symbol to avoid confusion
elements of arrays are indicated within brackets [ ]
This guide assumes that you have installed a copy of R onto your computer and you are trying out the commands as you read through this. (Not necessary, but useful.)
When in doubt, use the help () function. This is identical to the ? function where function is what you want to know about. e.g.,
? read.table #ask for help in using the read.table function -- see the answer in its own window #or help (read.table) #another way of asking for help. - see the windowFor a list of all the commands that use a particular word, use the apropos () command:
apropos(table) #lists all the commands that have the word "table" in them
apropos(table)
[1] "ftable" "model.tables" "pairwise.table" "print.ftable" "r2dtable"
[6] "read.ftable" "write.ftable" ".__C__mtable" ".__C__summary.table" ".__C__table"
[11] "as.data.frame.table" "as.table" "as.table.default" "is.table" "margin.table"
[16] "print.summary.table" "print.table" "prop.table" "read.table" "read.table.url"
[21] "summary.table" "table" "write.table" "write.table0"
For more complicated functions (e.g., plot(), lm()) the help function shows very useful examples. A very nice example is demo(graphics) which shows many of the complex graphics that are possible to do. demo(lm.glm) gives many examples of the linear model/general linear model.
For very small data sets, the data can be directly entered into R. For more typical data sets, it useful to use a simple text editor or a spreadsheet program (e.g., Excel or OpenOffice). You can enter data in a tab delimted form with one variable per column and columns labeled with unique name. A numeric missing value code (say -999) is more convenient than using "." ala Systat. To read the data into a rows (subjects) by columns (variables) matrix use the read.table command.
A very useful command, for those using a GUI is file.choose() which opens a typical selection window for choosing a file.
#specify the name and address of the remote file datafilename <- file.choose() # use the OS to find the file #e.g., datafilename <- "Desktop/epi.big5.txt" #locate the local directory person.data <- read.table(datafilename,header=TRUE) #read the data file #Alternatively, to read in a comma delimited file: #person.data <- read.table(datafilename,header=TRUE,sep=",")
For smallish problems (< 32K), you can just cut and paste from a spreadsheet or text file into R.
# On a PC, the commands are
my.data <- read.table(file("clipboard"),header=TRUE)
#on a Mac, the command
my.data <- read.table(pipe("pbpaste"),header=TRUE)
(This requires remembering a rather complex combination of syntax, or using the read.clipboard function (from the psych package) or copying the following function into R:
function(header=TRUE) {
MAC<-Sys.info()[1]=="Darwin" #are we on a Mac using the Darwin system?
if (!MAC ) {if (header) read.clipboard<-read.table(file("clipboard"),header=TRUE)
else read.clipboard<-read.table(file("clipboard")) }
else {
if (header) read.clipboard<- read.table(pipe("pbpaste"),header=TRUE)
else read.clipboard<- read.table(pipe("pbpaste"))}
}
Then, to copy the clipboard :
mydata <- read.clipboard() #assumes the first line has header information or mydata <- read.clipboard(header=FALSE) #no headers.
Files can be comma delimited (csv) as well. In this case, you can specify that the seperators are commas. For very detailed help on importing data from different formats (SPSS, SAS, Matlab, Systat, etc.) see the data help page from Cran.
For teaching, it is important to note that it is possible to have the file be a remote file read through the web. (Note that for some commands, there is an important difference between line feeds and carriage returns. For those who use Macs as web servers, make sure that the unix line feed is used rather than old Mac format carriage returns.) For simplicity in my examples I have separated the name of the file to be read from the read.table command. These two commands can be combined into one. The file can be local (on your hard disk) or remote.
For most data analysis, rather than manually enter the data into R, it is probably more convenient to use a spreadsheet (e.g., Excel or OpenOffice) as a data editor, save as a tab or comma delimited file, and then read the data or copy using the read.clipboard() command. Most of the examples in this tutorial assume that the data have been entered this way. Many of the examples in the help menus have small data sets entered using the c() command or created on the fly.
For the first example, we read data from a remote file server for several hundred subjects on 13 personality scales (5 from the Eysenck Personality Inventory (EPI), 5 from a Big Five Inventory (BFI), the Beck Depression Inventory, and two anxiety scales). The file is structured normally, i.e. rows represent different subjects, columns different variables, and the first row gives subject labels. (To read from a local file, we simply change the name of the datafilename.)
#specify the name and address of the remote file datafilename <- "http://personality-project.org/r/datasets/maps.mixx.epi.bfi.data" #datafilename <- "Desktop/epi.big5.txt" #read from local directory or # datafilename <- file.choose() # use the OS to find the file #in all cases person.data <- read.table(datafilename,header=TRUE) #read the data file #Alternatively, to read in a comma delimited file: #person.data <- read.table(datafilename,header=TRUE,sep=",") names(person.data) #list the names of the variables produces this output > >datafilename <- "http://personality-project.org/r/datasets/maps.mixx.epi.bfi.data" > #datafilename <- "Desktop/epi.big5.txt" #read from local directory > person.data <- read.table(datafilename,header=TRUE) #read the data file > #Alternatively, to read in a comma delimited file: > #person.data <- read.table(datafilename,header=TRUE,sep=",") > names(person.data) #list the names of the variables [1] "epiE" "epiS" "epiImp" "epilie" "epiNeur" "bfagree" "bfcon" "bfext" "bfneur" "bfopen" "bdi" "traitanx" [13] "stateanx" >The data are now in the data.table "person.data". Tables allow one to have columns that are either numeric or alphanumeric. To address a particular row (e.g., subject = 5) and column (variable = 7) you simplely enter the required fields:
person.data[5,8] #the 5th subject, 8th variable or person.data[5,"bfext"] #5th subject, "Big Five Inventory - Extraversion" variable #or person.data[5,] #To list all the data for a particular subject (e.g, the 5th subject) person.data[5:10,] #list cases 5 - 10 person.data[5:10,"bfext"] #list just the extraversion scores for subjects 5-10 person.data[5:10,4:8] #list the 4th through 8th variables for subjects 5 - 10.In order to select a particular subset of the data, use the subset function. The next example uses subset to display cases where the lie scale was pretty high
subset(person.data,epilie>6) #print the data meeting the logical condition epilie>6 #also try person.data[epilie>6]produces this output
epiE epiS epiImp epilie epiNeur bfagree bfcon bfext bfneur bfopen bdi traitanx stateanx 16 11 5 6 7 13 126 78 112 83 132 4 45 28 212 6 4 1 7 4 147 119 102 81 142 2 26 21One can also selectively display particular columns meeting particular criteria, or selectively extract variables from a dataframe.
subset(person.data[,2:5],epilie>5 & epiNeur<3) #notice that we are taking the logical 'and' epiS epiImp epilie epiNeur 12 12 3 6 1 118 8 2 6 2 subset(person.data[,3:7], epilie>6 | epiNeur ==2 ) #do a logical 'or' of the two conditions epi <- subset(data,select=epiE:epiNeur) #select particular variables from a data frame
datafilename <- "http://personality-project.org/r/datasets/finkel.sav" #remote file datafilename <- "/Users/bill/Library/Favorites/R.tutorial/datasets/finkel.sav" #local file eli.data <- read.spss(datafilename, use.value.labels=TRUE, to.data.frame=TRUE) #works for local but not remote? seems to be a problem in uploading to the server
Data sets can be saved for later analyses using the save command and then reloaded using load. If files are saved on remote servers, use the load(url(remoteURLname)) command.
save(object,file="local name") #save an object (e.g., a correlation matrix) for later analysis
load(file) #gets the object (e.g., the correlation matrix back)
load(url("http://personality-project.org/r/datasets/big5r.txt")) #get the correlation matrix
It is a useful habit to be consistent in your own naming conventions. Some use lower case letters for variables, Capitalized for data frames, all caps for functions, etc. It is easier to edit your code if you are reasonably consistent. Comment your code as well. This is not just so others can use it, but so that you can remember what you did 6 months later.
ls() #shows the variables in the workspace
help (name) #provides help about "name"
? name #does the same
rm(variable) #removes that variable
rm(list = ls()) #removes all variables from the work space
attach("table") #makes the elements of "table" available to be called directly
names(table) #what are the variables inside the table
variable <-c(value1,value2,value3 ...) #assigns values to a variable.
newvariable <- cbind (variable1, variable2, variable3 ... variable n) # make up a new array with these variables
# e.g.
ls() #show the variables in the workspace datafilename <- "http://personality-project.org/R/datasets/maps.mixx.epi.bfi.data" person.data <- read.table(datafilename,header=TRUE) #read the data file names(person.data) #list the names of the variables attach(person.data) #make the separate variable available epi <- cbind(epiE,epiS,epiImp,epilie,epiNeur) #form a new variable "epi" epi.df <- data.frame(epi) #actually, more useful to treat this variable as a data frame bfi.df <- data.frame(cbind(bfext,bfneur,bfagree,bfcon,bfopen)) #create bfi as a data frame as well ls() #show the variables y <- edit(person.data) #show the data.frame or matrix x in a text editor and save changes to y fix(person.data) #show the data.frame or matrix x in a text editor invisibible(edit(x)) #creates an edit window without also printing to console -- directly make changes. Similar to the most basic spreadsheet. Very dangerous! head(x) #show the first few lines of a data.frame or matrix tail(x) #show the last few lines of a data.frame or matrix page(x) #show the structure of x
imp2 <- epiImp*epiImp #square epiImp si <- epiS+epiImp #sum sociability and impulsivity statetotrait <- stateanx/traitanx #find the ratio of state to trait anxiety weirdproduct <- epi*bfi #multiply respective elements in two tables meanimp <- mean(imp) stdevimp <- sd(imp) standarizedimp <- scale(imp,scale=TRUE) #center and then divide by the standard deviation
Logical operations can be performed. Two of the most useful are the ability to replace if a certain condition holds, and to find subsets of the data.
person.data[person.data == -9]=NA #replace all cases where a particular code is observed with NA males=subset(person.data |gender ==1) #x=subset(y|condition)
The power of R is shown when we combine various data manipulation procedures. Consider the problem of scoring a multiple choice test where we are interested in the number of items correct for each person. Define a scoring key vector and then score as 1 each item match with that key. Then find the number of 1s for each subject. We use the Iris data set as an example.
iris2 <- round(iris[1:10,1:4]) #get a sample data set key <- iris2[1,] #make up an arbitary answer key score <- t(t( rowSums((iris2 == c(key[]))+0))) #look for correct item and add them up mydata <- data.frame(iris2,score) # key #what is the scoring key mydata #show the data with the number of items that match the key(Thanks to Gabor Grothendieck and R help for this trick.)
ls() #list the variables in the work space
ls()
[1] "bfi.df" "datafilename" "epi" "epi.df" "person.data"
summary(epi.df) #standard descriptive stats
epiE epiS epiImp epilie epiNeur
Min. : 1.00 Min. : 0.000 Min. :0.000 Min. :0.000 Min. : 0.00
1st Qu.:11.00 1st Qu.: 6.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.: 7.00
Median :14.00 Median : 8.000 Median :4.000 Median :2.000 Median :10.00
Mean :13.33 Mean : 7.584 Mean :4.368 Mean :2.377 Mean :10.41
3rd Qu.:16.00 3rd Qu.: 9.500 3rd Qu.:6.000 3rd Qu.:3.000 3rd Qu.:14.00
Max. :22.00 Max. :13.000 Max. :9.000 Max. :7.000 Max. :23.00
summary(person.data["epiImp"]) #descriptive stats for one variable -- note the quote marks
epiImp
Min. :0.000
1st Qu.:3.000
Median :4.000
Mean :4.368
3rd Qu.:6.000
Max. :9.000
summary(epiImp) #show summary statistics for just a single variable
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.368 6.000 9.000
#The following do what their name implies
max()
min()
mean()
median()
sum()
var() #produces the variance covariance matrix
sd() #standard deviation
mad() #(median absolute deviation)
fivenum() #Tukey fivenumbers min, lowerhinge, median, upper hinge, max
#e.g.
attach(person.data) #make the variables inside of data available to be called by name
max(epiImp)
min(epiImp)
mean(epiImp)
median(epiImp)
var(epiImp)
sqrt(var(epiImp))
sum(epiImp)
sd(epiImp)
mad(epiImp)
fivenum(epiImp)
# and thus, to standardize a scale,
impz=(epiImp-mean(epiImp))/sd(epiImp) #standardizes one variable
epi.df=data.frame(epiE,epiS,epiImp,epilie,epiNeur)
#but, to standardize a matrix is a little trickier
epiz=(epi.df-mean(epi.df))/sd(epi.df) #doesn't work
epicentered=epi.df-mean(epi.df) #nor does this center the data
epicentered=t(t(epi.df)-mean(epi.df)) #but this at least centers around the mean (by transposing and then subtracting)
epiz=t(t(epicentered)/sd(epi.df)) #try this to get z scores
#however, there is a very useful command (scale) that does this automatically
epiz=scale(epi,scale=T) #centers (around the mean) and scales by the sd
#to print out fewer decimals, the round(variable,decimals) function is used.
round(mean(epiImp),2)
round(var(epiImp),2)
round(sqrt(var(epiImp)),2)
fivenum(epiImp) #give the lowest, 25%, median, 75% and highest value (compare to summary)
#simple graphical descriptions:
stem(bfneur) #stem and leaf diagram
The decimal point is 1 digit(s) to the right of the |
3 | 45
4 | 26778
5 | 000011122344556678899
6 | 000011111222233344567888899
7 | 00000111222234444566666777889999
8 | 0011111223334444455556677789
9 | 00011122233333444555667788888888899
10 | 00000011112222233333444444455556667778899
11 | 000122222444556677899
12 | 03333577889
13 | 0144
14 | 224
15 | 2
round(cor(epi.df),2) #correlation matrix with values rounded to 2 decimals
epiE epiS epiImp epilie epiNeur
epiE 1.00 0.85 0.80 -0.22 -0.18
epiS 0.85 1.00 0.43 -0.05 -0.22
epiImp 0.80 0.43 1.00 -0.24 -0.07
epilie -0.22 -0.05 -0.24 1.00 -0.25
epiNeur -0.18 -0.22 -0.07 -0.25 1.00
round(cor(epi.df,bfi.df),2) #cross correlations between the 5 EPI scales and the 5 BFI scales
bfext bfneur bfagree bfcon bfopen
epiE 0.54 -0.09 0.18 -0.11 0.14
epiS 0.58 -0.07 0.20 0.05 0.15
epiImp 0.35 -0.09 0.08 -0.24 0.07
epilie -0.04 -0.22 0.17 0.23 -0.03
epiNeur -0.17 0.63 -0.08 -0.13 0.09
For a stunning set of graphics produced with R and the code for drawing them, see addicted to R: R Graph Gallery : Enhance your data visualisation with R.
#see the graphic window for the output boxplot(epi) #boxplot of the five epi scales hist(epiE) #simple histogram plot(epiE,epiImp) #simple scatter plot pairs(epi.df) #splom plotboxplot(epi)
hist(epiE)
pairs(epi) # Basic ScatterPlot Matrix (SPLOM)
More advanced plotting (see r.graphics and r.plotregressions for details)
The following are examples of Analysis of Variance. A more complete listing and discussion of these examples including the output is in the R.anova page. This section just gives example instructions. See also the ANOVA section of Jonathan Baron's page.
#tell where the data come from datafilename="http://personality-project.org/r/datasets/R.appendix1.data" data.ex1=read.table(datafilename,header=T) #read the data into a table aov.ex1 = aov(Alertness~Dosage,data=data.ex1) #do the analysis of variance summary(aov.ex1) #show the summary table print(model.tables(aov.ex1,"means"),digits=3) #report the means and the number of subjects/cell boxplot(Alertness~Dosage,data=data.ex1) #graphical summary appears in graphics window
datafilename="http://personality-project.org/R/datasets/R.appendix2.data" data.ex2=read.table(datafilename,header=T) #read the data into a table data.ex2 #show the data aov.ex2 = aov(Alertness~Gender*Dosage,data=data.ex2) #do the analysis of variance summary(aov.ex2) #show the summary table print(model.tables(aov.ex2,"means"),digits=3) #report the means and the number of subjects/cell boxplot(Alertness~Dosage*Gender,data=data.ex2) #graphical summary of means of the 4 cells attach(data.ex2) interaction.plot(Dosage,Gender,Alertness) #another way to graph the means detach(data.ex2)
#Run the analysis:
datafilename="http://personality-project.org/r/datasets/R.appendix3.data"
data.ex3=read.table(datafilename,header=T) #read the data into a table
data.ex3 #show the data
aov.ex3 = aov(Recall~Valence+Error(Subject/Valence),data.ex3)
summary(aov.ex3)
print(model.tables(aov.ex3,"means"),digits=3) #report the means and the number of subjects/cell
boxplot(Recall~Valence,data=data.ex3) #graphical output
datafilename="http://personality-project.org/r/datasets/R.appendix4.data" data.ex4=read.table(datafilename,header=T) #read the data into a table data.ex4 #show the data aov.ex4=aov(Recall~(Task*Valence)+Error(Subject/(Task*Valence)),data.ex4 ) summary(aov.ex4) print(model.tables(aov.ex4,"means"),digits=3) #report the means and the number of subjects/cell boxplot(Recall~Task*Valence,data=data.ex4) #graphical summary of means of the 6 cells attach(data.ex4) interaction.plot(Valence,Task,Recall) #another way to graph the interaction detach(data.ex4)
datafilename="http://personality-project.org/r/datasets/R.appendix5.data" data.ex5=read.table(datafilename,header=T) #read the data into a table #data.ex5 #show the data aov.ex5 = aov(Recall~(Task*Valence*Gender*Dosage)+Error(Subject/(Task*Valence))+(Gender*Dosage),data.ex5 ) summary(aov.ex5) print(model.tables(aov.ex5,"means"),digits=3) #report the means and the number of subjects/cell boxplot(Recall~Task*Valence*Gender*Dosage,data=data.ex5) #graphical summary of means of the 36 cells boxplot(Recall~Task*Valence*Dosage,data=data.ex5) #graphical summary of means of 18 cells
datafilename="/Users/bill/Desktop/R.tutorial/datasets/recall1.data" recall.data=read.table(datafilename,header=TRUE) recall.data #show the data
We can use the "stack() function to arrange the data in the correct manner. We then need to create a new data.frame (recall.df) to attach the correct labels to the correct conditions. This seems more complicated than it really is (although it is fact somewhat tricky). It is useful to list the data after the data frame operation to make sure that we did it correctly. (This and the next two examples are adapted from Baron and Li's page. ) We make use of the rep(), c(), and factor() functions.
rep (operation,number) repeats an operation number times
c(x,y) forms a vector with x and y elements
factor (vector) converts a numeric vector into factors for an ANOVA
More detail on this technique, with comparisons of the original data matrix to the stacked matrix to the data structure version is shown on the r.anova page.
raw=recall.data[,1:8] #just trial data
#First set some specific paremeters for the analysis -- this allows
numcases=27 #How many subjects are there?
numvariables=8 #How many repeated measures are there?
numreplications=2 #How many replications/subject?
numlevels1=2 #specify the number of levels for within subject variable 1
numlevels2=2 #specify the number of levels for within subject variable 2
stackedraw=stack(raw) #convert the data array into a vector
#add the various coding variables for the conditions
#make sure to check that this coding is correct
recall.raw.df=data.frame(recall=stackedraw,
subj=factor(rep(paste("subj", 1:numcases, sep=""), numvariables)),
replication=factor(rep(rep(c("1","2"), c(numcases, numcases)), numvariables/numreplications)),
time=factor(rep(rep(c("short", "long"), c(numcases*numreplications, numcases*numreplications)),numlevels1)),
study=rep(c("d45", "d90") ,c(numcases*numlevels1*numreplications,numcases*numlevels1*numreplications)))
recall.aov= aov(recall.values ~ time * study + Error(subj/(time * study)), data=recall.raw.df) #do the ANOVA
summary(recall.aov) #show the output
print(model.tables(recall.aov,"means"),digits=3) #show the cell means for the anova table
Consider the following models:
datafilename="http://personality-project.org/r/datasets/maps.mixx.epi.bfi.data" #where are the data
personality.data =read.table(datafilename,header=TRUE) #read the data file
names(personality.data) #what variables are in the data set?
attach(personality.data) #make the variables easier to use
model1 = lm(bdi~epiNeur) #simple regression of beck depression on Neuroticism
summary(model1) # basic statistical summary
#pass parameters to the graphics device
op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
pty = "s") # square plotting region,
# independent of device size
plot(model1) #diagnostic plots in the graphics window
model2=lm(bdi~epiNeur+traitanx) #add in trait anxiety
summary(model2) #basic output
plot(model2)
anova(model1,model2) #compare the difference between the two models
model2.5=lm(bdi~epiNeur*traitanx) #test for the interaction, note that the main effects are incorrect
summary(model2.5) #because we need to 0 center the data
anova(model2,model2.5) #compare the two models
#rescale the data to do the analysis
cneur=scale(epiNeur,scale=F) #0 center epiNeur
zneur=scale(epiNeur,scale=T) #standardize epiNeur
ctrait = scale(traitanx,scale=F) #0 center traitAnx
model3=lm(bdi~cneur+ctrait+cneur*ctrait)
summary(model3) #explicitly list the additive and interactive terms
plot(model3)
model4=lm(bdi~cneur*ctrait)
summary(model4) #note how this is exactly the same as the previous model
epi=cbind(epiS,epiImp,epilie,epiNeur) #form a new variable "epi" without overall extraversion
epi=as.matrix(epi) #actually, more useful to treat this variable as a matrix
bfi=as.matrix(cbind(bfext,bfneur,bfagree,bfcon,bfopen)) #create bfi as a matrix as well
epi=scale(epi,scale=T) #standardize the epi
bfi=scale(bfi,scale=T) #standardize the bfi
model5=lm(bdi~bfi) #model beck depression by the Big 5 inventory
summary(model5)
model6=lm(bdi~bfi+epi) #
summary(model6)
model7 = lm(bdi~bfi*epi) #additive model of epi and bfi as well as the interactions between the sets
summary(model7) #given as an example of overfitting
## At end of plotting, reset to previous settings:
par(op)
Further examples show how multiple regression can be done with multiple dependent variables at the same time and how regressions can be done based upon the correlation matrix rather than the raw data.
In order to visualize interactions, it is useful to plot regression lines separately for different groups. This is demonstrated in some detail in a real example based upon heating demands of two houses.
One of the most common problems in personality research is to combine a set of items into a scale. Questions to ask of these items and the resulting scale are a) what are the item means and variances. b) What are the intercorrelations of the items in the scale. c) What are the correlations of the items with the composite scale. d) what is an estimate of the internal consistency reliability of the scale. (For a somewhat longer discussion of this, see the internal structure of tests.)
The following steps analyze a small subset of the data of a large project (the synthetic aperture personality measurement project at the Personality, Motivation, and Cognition lab). The data represent responses to five items sampled from items measuring extraversion, emotional stability, agreeableness, conscientiousness, and openness taken from the IPIP (International Personality Item Pool) for 200 subjects.
#get the data
datafilename="http://personality-project.org/R/datasets/extraversion.items.txt" #where are the data
items=read.table(datafilename,header=TRUE) #read the data
attach(items) #make this the active path
E1=q_262 -q_1480 +q_819 -q_1180 +q_1742 +14 #find a five item extraversion scale
#note that because the item responses ranged from 1-6, to reverse an item
#we subtract it from the maximum response possible + the minimum.
#Since there were two reversed items, this is the same as adding 14
E1.df = data.frame(q_262 ,q_1480 ,q_819 ,q_1180 ,q_1742 ) #put these items into a data frame
summary(E1.df) #give summary statistics for these items
round(cor(E1.df,use="pair"),2) #correlate the 5 items, rounded off to 2 decimals, use pairwise cases
round(cor(E1.df,E1,use="pair"),2) #show the item by scale correlations
#define a function to find the alpha coefficient
alpha.scale=function (x,y) #create a reusable function to find coefficient alpha
#input to the function are a scale and a data.frame of the items in the scale
{
Vi=sum(diag(var(y,na.rm=TRUE))) #sum of item variance
Vt=var(x,na.rm=TRUE) #total test variance
n=dim(y)[2] #how many items are in the scale? Calculated dynamically
((Vt-Vi)/Vt)*(n/(n-1))} #alpha
E.alpha=alpha.scale(E1,E1.df) #find the alpha for the scale E1 made up of the 5 items in E1.df
Produces the following output:
summary(E1.df) #give summary statistics for these items
q_262 q_1480 q_819 q_1180 q_1742
Min. :1.00 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:2.00 1st Qu.:2.000 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:3.750
Median :3.00 Median :3.000 Median :5.000 Median :4.000 Median :5.000
Mean :3.07 Mean :2.885 Mean :4.565 Mean :3.295 Mean :4.385
3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:6.000
Max. :6.00 Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
round(cor(E1.df,use="pair"),2) #correlate the 5 items, rounded off to 2 decimals, use complete cases
q_262 q_1480 q_819 q_1180 q_1742
q_262 1.00 -0.26 0.41 -0.51 0.48
q_1480 -0.26 1.00 -0.66 0.52 -0.47
q_819 0.41 -0.66 1.00 -0.41 0.65
q_1180 -0.51 0.52 -0.41 1.00 -0.49
q_1742 0.48 -0.47 0.65 -0.49 1.00
round(cor(E1.df,E1,use="pair"),2) #show the item by scale correlations
[,1]
q_262 0.71
q_1480 -0.75
q_819 0.80
q_1180 -0.78
q_1742 0.80
alpha.scale=function (x,y) #find coefficient alpha given a scale and a data.frame of the items in the scale
+ {
+ Vi=sum(diag(var(y,na.rm=TRUE))) #sum of item variance
+ Vt=var(x,na.rm=TRUE) #total test variance
+ n=dim(y)[2] #how many items are in the scale? Calculated dynamically
+ ((Vt-Vi)/Vt)*(n/(n-1))} #alpha
>
>
>
E.alpha=alpha.scale(E1,E1.df,5) #find the alpha for the scale E1 made up of the 5 items in E1.df
E.alpha
[1] 0.822683
If you are using multiple choice tests and want to score the items following a key, it is possible to use the power of R for data manipulation:
iris2 <- round(iris[1:10,1:4]) #get a sample data set key <- iris2[1,] #make up an arbitary answer key score <- t(t( rowSums((iris2 == c(key[]))+0))) #look for correct item and add them up mydata <- data.frame(iris2,score) # key #what is the scoring key mydata #show the data with the number of items that match the key (Thanks to Gabor Grothendieck and R help for this trick.)
For the following analyses, we will use data from the Motivational State Questionnaire (MSQ) collected in several studies. This is a subset of a larger data set (with N>3000 that is being analyzed for the structure of mood states. The data set includes EPI and BFI scale scores (see previous examples).
#specify the name and address of the remote file datafilename="http://personality-project.org/r/datasets/maps.mixx.msq1.epi.bf.txt" data =read.table(datafilename,header=TRUE) #read the data file msq=data[,2:73] #select the subset of items in the MSQ msq[msq=="9"] = NA # change all occurences of 9 to be missing values msq <- data.frame(msq) #convert the input matrix into a data frame for easier manipulation names(msq) #what are the variables? summary(msq) #basic summary statistics -- check for miscodings cleaned=na.omit(msq) #remove the cases with missing values f2=factanal(cleaned,2,rotation="varimax") #factor analyze the resulting item #(f2) #show the result load=loadings(f2) print(load,sort=TRUE,digits=2,cutoff=0.01) #show the loadings plot(load) #plot factor 1 by 2 identify(load,labels=names(msq)) #put names of selected points onto the figure
It is also possible to find and save the covariance matrix from the raw data and then do subsequent analyses on this matrix. This clearly saves computational time for large data sets. This matrix can be saved and then reloaded.
msqcovar=cov(msq,use="pair") #find the covariance matrix for later factoring f3=factanal(x, factors=3, data = NULL, covmat = msqcovar, n.obs = 300,start=NULL, rotation = "varimax") f4=factanal(x, factors=4, data = NULL, covmat = msqcovar, n.obs = 300,start=NULL, rotation = "varimax")
There are multiple ways to determine the appropriate number of factors in exploratory factor analysis. Routines for the Very Simple Structure (VSS) criterion allow one to compare solutions of varying complexity and for different number of factors. Alternatives include the scree test. To use these routines on a data set with items, myitems,:
source("http://personality-project.org/r/vss.r")
my.vss <- VSS(mydata,n=8,rotate="none",diagonal=FALSE,...) #compares up to 8 factors
op <- par(mfrow=c(1,2)) #make a two panel graph
VSS.plot(my.vss) #shows a simple summary of VSS
VSS.scree(cor(mydata),main="scree plot of principal components of mydata")
McDonald has proposed coefficient omega as an estimate of the general factor saturation of a test. Zinbarg, Revelle, Yovel and Li (2005) compare McDonald's Omega to Chronbach's alpha and Revelle's beta. They conclude that omega is the best estimate. (See also Zinbarg et al., 2006)
One way to find omega is to do a factor analysis of the original data set, rotate the factors obliquely, do a Schmid Leiman transformation, and then find omega. Here we present code to do that. This code is included in the psych package of routines for personality research that may be loaded from the CRAN repository or, for the the recent development version, from the local repository at http://personality-project.org/r.
Beta, an alternative to omega, is defined as the worst split half reliability. It can be estimated by using ICLUST (Revelle, 1979), a hierarchical clustering algorithm originally developed for main frames and written in Fortran and that is now being developed in R. (For a very complimentary review of why the ICLUST algorithm is useful in scale construction, see Cooksey and Soutar, 2005). What took multiple years and about 2500 lines of code in Fortran has taken about 4 days and 100 lines of R.
Rotations available in the basic R installation are Varimax and Promax. A powerful additional set of oblique transformations including Oblimin, Oblimax, etc. are available in the GPArotations package from Cran. Using this package, it is also possible to do a Schmid Leiman transformation of a hierarchical factor structure to estimate the general factor loadings and general factor saturation of a test. (see Omega .
An example of clustering variables can be seen using the ICLUST algorithm to the 24 tests of mental ability of Harman and then using the Graphviz program to show the results.
r.mat<- Harman74.cor$cov ic.demo <- ICLUST(r.mat) ICLUST.graph(ic.demo,out.file = ic.demo.dot)
Model: Distance = square root of sum of squared distances on k dimensions dxy = √∑(xi-yi)2
Data: a matrix of distances
Find the dimensional values in k = 1, 2, ... dimensions for the objects that best reproduces the original data.
Example: Consider the distances between nine American cities. Can we represent these cities in a two dimensional space.
See the pages on multidimensional scaling and Thurstonian scaling.
Jan de Leeuw at UCLA is converting a number of mds packages including the ALSCAL program for R. See his page at http://www.cuddyvalley.org/psychoR/code. ALSCAL generalizes the INDSCAL algorithm for individual differences in multiple dimensional scaling.
Structural equation models combine measurement models (e.g., reliability) with structural models (e.g., regression). The sem package, developed by John Fox, allows for some basic structural equation models. To use it, add the sem package.
Structural Equation Modeling may be thought of as regression corrected for attentuation. The sem package developed by John Fox uses the RAM path notation of Jack McCardle and is fairly straightforward. Fox has prepared a brief description of SEM techniques as an appendix to his statistics text. The examples in the package are quite straightforward. A text book, such as John Loehlin's Latent Variable Models (4th Edition) is helpful in understanding the algorithm.
Demonstrations of using the sem package for several of the Loehlin problems are discussed in more detail on a separate page. In addition, lecture notes for a course on sem (primarily using R) are available at the syllabus for my sem course.
For a detailed demonstration of how to do 1 parameter IRT (the Rasch Model) see Jonathan Baron and Yuelin Li's tutorial on R. A more useful package for latent trait modeling (e.g., Rasch modeling and item response theory) (ltm) has now been released by Dimitris Rizopoulos. Note that to use this package, you must first install the MASS, gtools, and msm packages.
The ltm package allows for 1 parameter (Rasch) and two parameter (location and discrimination) modeling of a single latent trait assessed by binary items varying in location and discrimination. Procedures include graphing the results as well as anova comparisons of nested models.
There are at least 20 distributions available including the normal, the binomial, the exponential, the logistic, Poisson, etc. Each of these can be accessed for density ('d'), cumulative density ('p'), quantile ('q') or simulation ('r' for random deviates). Thus, to find out the probability of a normal score of value x from a distribution with mean=m and sd = s,
pnorm(x,mean=m, sd=s), eg.
pnorm(1,mean=0,sd=1) [1] 0.8413447 same as pnorm(1) #default values of mean=0, sd=1 are used) pnorm(1,1,10) #parameters may be passed if in default order or by nameExamples of using R to demonstrate the central limit theorem or to produce circumplex structure of personality items are useful ways to teach sampling theory or to explore artifacts in measurement.
Simulating a simple correlation between two variables based upon their correlation with a latent variable:
samplesize=1000 size.r=.6 theta=rnorm(samplesize,0,1) #generate some random normal deviates e1=rnorm(samplesize,0,1) #generate errors for x e2=rnorm(samplesize,0,1) #generate errors for y weight=sqrt(size.r) #weight as a function of correlation x=weight*theta+e1*sqrt(1-size.r) #combine true score (theta) with error y=weight*theta+e2*sqrt(1-size.r) cor(x,y) #correlate the resulting pair df=data.frame(cbind(theta,e1,e2,x,y)) #form a data frame to hold all of the elements round(cor(df),2) #show the correlational structure pairs.panels(df) #plot the correlational structure (assumes psych package)
The use of the mvtnorm package and the rmvnorm(n, mean, sigma) function allows for creating random covariance matrices with a specific structure.
e.g., using the sample sizes and rs from above
library(mvtnorm) samplesize=1000 size.r=.6 sigmamatrix <- matrix( c(1,sqrt(size.r),sqrt(size.r),sqrt(size.r),1,size.r,sqrt(size.r),size.r,1),ncol=3) xy <- rmvnorm(samplesize,sigma=sigmamatrix) round(cor(xy),2) pairs.panels(xy) #assumes the psych package
Another simulation shows how to create a multivariate structure with a particular measurement model and a particular structural model. This example produces data suitable for demonstrations of regression, correlation, factor analysis, or structural equation modeling.
The particular example assumes that there are 3 measures of ability (GREV, GREQ, GREA), two measures of motivation (achievment motivation and anxiety), and three measures of performance (Prelims, GPA, MA). These titles are, of course, arbitrary and can be changed easily. These (simulated) data are used in various lectures of mine on Factor Analysis and Psychometric Theory. Other examples of simulation of item factor structure include creating circumplex structured correlation matrices.
Simulations may also be used to teach concepts in experimental design and analysis. Several simulations of the additive and interactive effects of personality traits with situational manipulations of mood and arousal compare alternative ways of analyzing data.
Finally, a simple simulation to generate and test circumplex versus structures in two dimensions has been added.
A very powerful feature of R is that it is extensible. That is, one can write one's own functions to do calculations of particular interest. These functions can be saved as external file that can be accessed using the source command. For example, I frequently use a set of routines that find the alpha reliability of a set of scales, or to draw scatter plots with histograms on the diagonal and font sizes scaled to the absolute value of the correlation. Another set of routines applies the Very Simple Structure (VSS) criterion for determining the optimal number of factors. I originally stored these two sets of files (vss.r and useful.r) and then added them into my programs by using the source command.
source("http://personality-project.org/r/useful.r") #some basic data manipulation procedures
source("http://personality-project.org/r/vss.r") #the Very Simple Structure package
However, an even more powerful option is to create packages of frequently used code. These packages can be stored on local "repositories" if they appeal to members of small work groups, or can be added to the CRAN master repository. Guidelines for creating packages are found in the basic help menus. Notes on how to create a specific package (i.e., the psych package) and install it as a repository are meant primarily for Mac users. More extensive discussions for PCs have also been developed.
psych was added to CRAN in about 2006 and is now up to version 1.0.43. As I find an analysis that I need to do that is not easily done using the standard packages, I supplement the psych package. I encourage the reader to try it. The manual is available on line or as part of the package.
To add a package from CRAN (e.g, sem, GPArotation, psych), go to the R package installer, and select install. Then, using the R package Manager, load that package.
To add a package (e.g., psych) from the personality-project.org web page, select the other repository option in the R Package Installer and set it to http://personality-project.org/r . A list of available packages stored there will be displayed. Choose the one you want and then load it with the R package manager. Currently, there is only one package (psych) there. This version will be the current development version will be at least as new as what is on the CRAN site. You must specify that you want a source file if use the personality-project.org/r repository. As of now (May 25, 2008), the psych package contains more than a 100 functions including the following ones:
| psych-package | A package for personality, psychometric, and psychological research |
| %+% | A function to add two vectors or matrices |
| alpha.scale | Cronbach alpha for a scale |
| circ.sim | Generate simulated data structures for circumplex or simple structure |
| circ.simulation | Simulations of circumplex and simple structure |
| circ.tests | Apply four tests of circumplex versus simple structure |
| cluster.cor | Find correlations of composite variables from a larger matrix |
| cluster.fit | cluster Fit: fit of the cluster model to a correlation matrix |
| cluster.loadings | Find item by cluster correlations, corrected for overlap and reliability |
| correct.cor | Find dis-attenuated correlations and give alpha reliabilities |
| count.pairwise | Count number of pairwise cases for a data set with missing (NA) data. |
| describe | Basic descriptive statistics useful for psychometrics |
| describe.by | Basic summary statistics by group |
| eigen.loadings | Extract eigen vectors, eigen values, show loadings |
| error.crosses | Plot x and y error bars |
| factor.congruence | Coefficient of factor congruence |
| factor.fit | How well does the factor model fit a correlation matrix. Part of the VSS package |
| factor.model | Find R = F F' + U2 is the basic factor model |
| factor.pa | Principal Axis Factor Analysis |
| factor.residuals | R* = R- F F' |
| factor.rotate | "Hand" rotate a factor loading matrix |
| factor2cluster | Extract cluster definitions from factor loadings |
| fisherz | Fisher z transform of r |
| geometric.mean | Find the geometric mean of a vector or columns of a data.frame. |
| harmonic.mean | Find the harmonic mean of a vector, matrix, or columns of a data.frame |
| ICLUST | ICLUST: Item Cluster Analysis - Hierarchical cluster analysis using psychometric principles |
| ICLUST.cluster | Function to form hierarchical cluster analysis of items |
| ICLUST.graph | create control code for ICLUST graphical output |
| ICLUST.sort | sort items by absolute size of cluster loadins |
| irt.0p | Item Response Theory estimate of theta (ability) using a Rasch (like) model |
| irt.1p | Item Response Theory estimate of theta (ability) using a Rasch (like) model |
| irt.2p | Item Response Theory estimate of theta (ability) using a Rasch (like) model |
| irt.discrim | Simple function to estimate item difficulties using IRT concepts |
| irt.item.diff.rasch | Simple function to estimate item difficulties using IRT concepts |
| irt.person.rasch | Item Response Theory estimate of theta (ability) using a Rasch (like) model |
| kurtosi | Kurtosis of a vector, matrix, or data frame |
| make.hierarchical | Create a population or sample correlation matrix with hierachical structure. |
| mat.regress | Multiple Regression from matrix input |
| multi.hist | Multiple histograms on one screen |
| omega | Calculate the omega estimate of factor saturation |
| paired.r | Test the difference between paired correlations |
| pairs.panels | SPLOM, histograms and correlations for a data matrix |
| panel.cor | SPLOM, histograms and correlations for a data matrix |
| panel.hist | SPLOM, histograms and correlations for a data matrix |
| phi | Find the phi coefficient of correlation between two dichotomous variables |
| phi2poly | Convert a phi coefficient to a polychoric correlation |
| polychor.matrix | Actually, what does this do? Convert a matrix of phi coefficients to a matrix of polycoric correlations |
| principal | Principal components analysis |
| psych | A package for personality, psychometric, and psychological research |
| psycho.demo | Create demo data for psychometrics |
| read.clipboard | shortcut for reading from the clipboard |
| schmid | Apply the Schmid Leiman transformation to a correlation matrix |
| score.alpha | Score scales and find Cronbach's alpha as well as associated statistics |
| score.items | Score item composite scales and find Cronbach's alpha as well as associated statistics |
| skew | Calculate skew for a vector, matrix, or data.frame |
| VSS | Apply the Very Simple Structure criterion to determine the appropriate number of factors. |
| VSS.parallel | Compare real and random VSS solutions |
| VSS.plot | Plot VSS fits |
| VSS.scree | Plot a scree test |
| VSS.simulate | create VSS like data |
To install the psych package using a Mac, go to the Package Installer Menu option, choose other repository, enter http://personality-project.org/r and it should get the package. Uncheck the "binary" option. This is a source file.
To install on a PC, you must get the zip file from http://personality-project.org/r, and then use the package installer to read the local file.
Once installed, the package can be activated by using the package manager (or just issuing the package(psych) command.
#read files with labels in first row read.table(filename,header=TRUE) #read a tab or space delimited file read.table(filename,header=TRUE,sep=',') #read csv files (comma separated) x=c(1,2,4,8,16 ) #create a data vector with specified elements y=c(1:8,1:4) #creat a data vector with 12 entries matr=rbind(1:8,1:4) #create two rows in a 2 * 8 matrix matc=cbind(1:8,1:4) #create two columns in a 8 * 2 matrix n=10 x1=c(rnorm(n)) #create a n item vector of random normal deviates y1=c(runif(n))+n #create another n item vector that has n added to each random uniform distribution z=rbinom(n,size,prob) #create n samples of size "size" with probability prob from the binomialitem sample(x, size, replace = FALSE, prob = NULL) #take a sample (with or without replacement) of size from x vect=c(x,y) #combine them into one vector of length 2n mat=cbind(x,y) #combine them into a n x 2 matrix (column wise) mat[4,2] #display the 4th row and the 2nd column mat[3,] #display the 3rd row mat[,2] #display the 2nd column mat=cbind(rep(1:4,2),rep(4:1,2)) #create a 8 * 2 matrix with repeating elements subset(data,logical) #those objects meeting a logical criterion subset(data.df,select=variables,logical) #get those objects from a data frame that meet a criterion
ls() #list the variables in the workspace rm(x) #remove x from the workspace rm(list=ls()) #remove all the variables from the workspace attach(mat) #make the names of the variables in the matrix available detach(mat) #releases the names new=old[,-n] #drop the nth column new=old[n,] #drop the nth row new=subset(old,logical) #select those cases that meet the logical condition complete = subset(data,complete.cases(data)) #find those cases with no missing values new=old[n1:n2,n3:n4] #select the n1 through n2 rows of variables n3 through n4)
x.df=data.frame(x1,x2,x3 ...) #combine different kinds of data into a data frame as.data.frame() is.data.frame() x=as.matrix() scale() #converts a data frame to standardized scores factor() #converts a numeric variable into a factor (essential for ANOVA) gl(n,k,length) #makes an n-level,k replicates, length long vectof factors y <- edit(x) #opens a screen editor and saves changes made to x intoy fix(x) #opens a screen editor window and makes and saves changes to x
max() min() mean() median() sum() var() #produces the variance covariance matrix sd() #standard deviation mad() #(median absolute deviation) fivenum() #Tukey fivenumbers min, lowerhinge, median, upper hinge, max scale(data,scale=T) #centers around the mean and scales by the sd) colSums(), rowSums(), colMeans(), rowMeans() #see also apply(x,1,sum) rowsum(x,group) #sum by group cor(x,y,use="pair") #correlation matrix for pairwise complete data, use="complete" for complete cases t.test(x,y) #x is a data vector, y is a grouping vector independent groups t.test(x,y,pair=TRUE) #x is a data vector, y is a grouping vector -- paired groups pairwise.t.test(x,g) does multiple comparisons of all groups defined by g aov(x~y,data=datafile) #where x and y can be matrices aov.ex1 = aov(Alertness~Dosage,data=data.ex1) #do the analysis of variance or aov.ex2 = aov(Alertness~Gender*Dosage,data=data.ex2) #do a two way analysis of variance summary(aov.ex1) #show the summary table print(model.tables(aov.ex1,"means"),digits=3) #report the means and the number of subjects/cell boxplot(Alertness~Dosage,data=data.ex1) #graphical summary appears in graphics window lm(x~y,data=dataset) #basic linear model where x and y can be matrices lm(Y~X) #Y and X can be matrices lm(Y~X1+X2) lm(Y~X|W) #separate analyses for each level of W solve(A,B) #inverse of A * B - used for linear regression solve(A) #inverse of A
colSums (x, na.rm = FALSE, dims = 1)
rowSums (x, na.rm = FALSE, dims = 1)
colMeans(x, na.rm = FALSE, dims = 1)
rowMeans(x, na.rm = FALSE, dims = 1)
rowsum(x, group, reorder = TRUE, ...) #finds row sums for each level of a grouping variable
apply(X, MARGIN, FUN, ...) #applies the function (FUN) to either rows (1) or columns (2) on object X
apply(x,1,min) #finds the minimum for each row
apply(x,2,max) #finds the maximum for each column
col.max(x) #another way to find which column has the maximum value for each row
which.min(x)
which.max(x)
z=apply(big5r,1,which.min) #tells the row with the minimum value for every column
stem() #stem and leaf diagram
par(mfrow=c(2,1)) #number of rows and columns to graph
boxplot(x,notch=T,names= grouping, main="title") #boxplot (box and whiskers)
hist() #histogram
plot()
plot(x,y,xlim=range(-1,1),ylim=range(-1,1),main=title)
par(mfrow=c(1,1)) #change the graph window back to one figure
symb=c(19,25,3,23)
colors=c("black","red","green","blue")
charact=c("S","T","N","H")
plot(x,y,pch=symb[group],col=colors[group],bg=colors[condit],cex=1.5,main="main title")
points(mPA,mNA,pch=symb[condit],cex=4.5,col=colors[condit],bg=colors[condit])
curve()
abline(a,b)
abline(a, b, untf = FALSE, ...)
abline(h=, untf = FALSE, ...)
abline(v=, untf = FALSE, ...)
abline(coef=, untf = FALSE, ...)
abline(reg=, untf = FALSE, ...)
identify()
plot(eatar,eanta,xlim=range(-1,1),ylim=range(-1,1),main=title)
identify(eatar,eanta,labels=labels(energysR[,1]) ) #dynamically puts names on the plots
locate()
pairs() #SPLOM (scatter plot Matrix)
matplot () #ordinate is row of the matrix
biplot () #factor loadings and factor scores on same graph
coplot(x~y|z) #x by y conditioned on z
symb=c(19,25,3,23) #choose some nice plotting symbols
colors=c("black","red","green","blue") #choose some nice colors
barplot() #simple bar plot
interaction.plot () #shows means for an ANOVA design
plot(degreedays,therms) #show the data points
by(heating,Location,function(x) abline(lm(therms~degreedays,data=x))) #show the best fitting regression for each group
x= recordPlot() #save the current plot device output in the object x
replayPlot(x) #replot object x
dev.control #various control functions for printing/saving graphic files