% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- \documentclass[12pt]{article} \usepackage{geometry} % See geometry.pdf to learn the layout options. There are lots. \geometry{letterpaper} % ... or a4paper or a5paper or ... %\geometry{landscape} % Activate for for rotated page geometry %\usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent \usepackage[ae,hyper]{Rd} \usepackage{graphicx} %\usepackage{amssymb} \usepackage{epstopdf} \DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png} \title{Appendix--1 \\Matrix Algebra in R} \author{\href{http://personality-project.org/revelle.html}{William Revelle}\\ Northwestern University} \SweaveOpts{echo=TRUE} \usepackage{a4wide} \begin{document} \maketitle \thanks{Prepared as part of a \href{http://personality-project.org/revelle/syllabi/454.syllabus}{course} on Latent Variable Modeling, Winter, 2007 and as a supplement to the \href{http://personality-project.org/r/r.guide.html}{Guide to R for psychologists}. email comments to: \href{mailto:revelle@northwestern.edu}{revelle@northwestern.edu}} \Rdcontents{} \newpage Much of psychometrics in particular, and psychological data analysis in general consists of operations on vectors and matrices. This appendix offers a quick review of matrix operations with a particular emphasis upon how to do matrix operations in R. For more information on how to use R, consult \href{http://personality-project.org/r/r.guide.html}{the short guide to R for psychologists} (at \url{http://personality-project.org/r/r.guide.html}) or the even \href{http://personality-project.org/r/r.205.tutorial.html}{shorter guide} at \url{personality-project.org/r/r.205.tutorial.html}. \section{Vectors} A vector is a one dimensional array of n numbers. Basic operations on a vector are addition and subtraction. Multiplication is somewhat more complicated, for the order in which two vectors are multiplied changes the result. That is $AB \neq BA$. Consider V1 = the first 10 integers, and V2 = the next 10 integers: <>= V1 <- as.vector(seq(1,10)) V2 <- as.vector(seq(11,20)) @ We can add a constant to each element in a vector <>= V4 <- V1 + 20 @ or we can add each element of the first vector to the corresponding element of the second vector <>= V3 <- V1 + V2 @ Strangely enough, a vector in R is dimensionless, but it has a length. If we want to multiply two vectors, we first need to think of the vector either as row or as a column. A column vector can be made into a row vector (and vice versa) by the transpose operation. While a vector has no dimensions, the \textbf{transpose} of a vector is two dimensional! It is a matrix with with 1 row and n columns. (Although the dim command will return no dimensions, in terms of multiplication, a vector is a matrix of n rows and 1 column.) Consider the following: <>= dim(V1) #a vector has no dimensions just length length(V1) #how many elements are the vector dim(t(V1)) #but the transpose is a matrix dim(t(t(V1))) #as is the transpose of the transpose TV<- t(V1) t(TV) @ \subsection{Vector multiplication} Just as we can add a number to every element in a vector, so can we multiply a number (a``scaler") by every element in a vector. <>= V2 <- 4 * V1 @ There are three types of multiplication of vectors in R. Simple multiplication (each term in one vector is multiplied by its corresponding term in the other vector), as well as the inner and outer products of two vectors. Simple multiplication requires that each vector be of the same length. Using the V1 and V2 vectors from before, we can find the 10 products of their elements: <>= V1 V2 V1 * V2 @ The ``outer product" of a n * 1 element vector with a 1 * m element vector will result in a n * m element matrix. (The dimension of the resulting product is the outer dimensions of the two vectors in the multiplication). The vector multiply operator is \%*\%. In the following equation, the subscripts refer to the dimensions of the variable. \begin{equation} _nX_1 * _1Y_m = _n(XY)_m \end{equation} <>= V1 <- seq(1,10) #a 1 x n matrix V2 <- seq(1,4) # a n x 1 matrix V1 V2 outer.prod <- V1 %*% t(V2) outer.prod @ The outer product of the first ten integers is, of course, the multiplication table known to all elementary school students: <>= outer.prod <- V1 %*% t(V1) #(n x 1) * (1 x n ) = (n x n) @ The ``inner product" is perhaps a more useful operation, for it not only multiplies each corresponding element of two vectors, but also sums the resulting product: \begin{equation} inner.product = \sum_{i=1}^N V1_i * V2_i \end{equation} <>= V1 <- seq(1,10) #a 1 x n matrix V2 <- seq(11,20) # a n x 1 matrix V1 V2 in.prod <- t(V1) %*% V2 # (1 x n) * (n x 1) = (1 x 1) in.prod @ Note that the inner product of two vectors is of length =1 but is a matrix with 1 row and 1 column. (This is the dimension of the inner dimensions (1) of the two vectors.) \subsection{Simple statistics using vectors} Although there are built in functions in R to do most of our statistics, it is useful to understand how these operations can be done using vector and matrix operations. Here we consider how to find the mean of a vector, remove it from all the numbers, and then find the average squared deviation from the mean (the variance). Consider the mean of all numbers in a vector. To find this we just need to add up the numbers (the inner product of the vector with a vector of 1's) and then divide by n (multiply by the scaler 1/n). First we create a vector of 1s by using the repeat operation. We then show three different equations for the mean.V, all of which are equivalent. <>= V <- V1 one <- rep(1,length(V)) sum.V <- t(one) %*% V mean.V <- sum.V *(1/length(V)) mean.V <- t(one) %*% V *(1/length(V)) mean.V <- t(one) %*% V/length(V) @ The variance is the average squared deviation from the mean. To find the variance, we first find deviation scores by subtracting the mean from each value of the vector. Then, to find the sum of the squared deviations take the inner product of the result with itself. This Sum of Squares becomes a variance if we divide by the degrees of freedom (n-1) to get an unbiased estimate of the population variance). First we find the mean centered vector: <>= V - mean.V @ And then we find the variance as the mean square by taking the inner product: <>= Var.V <- t(V-mean.V) %*% (V-mean.V) *(1/(length(V)-1)) @ Compare these results with the more typical scale, mean and var operations: <>= scale(V,scale=FALSE) mean(V) var(V) @ \subsection{Combining vectors} We can form more complex data structures than vectors by combining the vectors, either by columns (\textbf{cbind}) or by rows (\textbf{rbind}). The resulting data structure is a matrix. <>= Xc <- cbind(V1,V2,V3) Xr <- rbind(V1,V2,V3) dim(Xc) dim(Xr) @ \section{Matrices} A matrix is just a two dimensional (rectangular) organization of numbers. It is a vector of vectors. For data analysis, the typical data matrix is organized with columns representing different variables and rows containing the responses of a particular subject. Thus, a 10 x 4 data matrix (10 rows, 4 columns) would contain the data of 10 subjects on 4 different variables. Note that the matrix operation has taken the numbers 1 through 40 and organized them column wise. That is, a matrix is just a way (and a very convenient one at that) of organizing a vector. R provides numeric row and column names (e.g., [1,] is the first row, [,4] is the fourth column, but it is useful to label the rows and columns to make the rows (subjects) and columns (variables) distinction more obvious. <>= Xij <- matrix(seq(1:40),ncol=4) rownames(Xij) <-paste("S",seq(1,dim(Xij)[1]),sep="") colnames(Xij) <- paste("V",seq(1,dim(Xij)[2]),sep="") Xij @ Just as the transpose of a vector makes a column vector into a row vector, so does the transpose of a matrix swap the rows for the columns. Note that now the subjects are columns and the variables are the rows. <>= t(Xij) @ \subsection{Matrix addition} \label{addition} The previous matrix is rather uninteresting, in that all the columns are simple products of the first column. A more typical matrix might be formed by sampling from the digits 0-9. For the purpose of this demonstration, we will set the random number seed to a memorable number so that it will yield the same answer each time. <>= set.seed(42) Xij <- matrix(sample(seq(0,9),40,replace=TRUE),ncol=4) rownames(Xij) <-paste("S",seq(1,dim(Xij)[1]),sep="") colnames(Xij) <- paste("V",seq(1,dim(Xij)[2]),sep="") print(Xij) @ Just as we could with vectors, we can add, subtract, muliply or divide the matrix by a scaler (a number with out a dimension). <>= Xij + 4 round((Xij+4)/3,2) @ We can also multiply each row (or column, depending upon order) by a vector. <>= V Xij * V @ \subsection{Matrix multiplication} Matrix multiplication is a combination of multiplication and addition. For a matrix $_mX_n$ of dimensions m*n and $_nY_p$ of dimension n * p, the product, $_mXY_p$ is a m * p matrix where each element is the sum of the products of the rows of the first and the columns of the second. That is, the matrix $_mXY_p$ has elements $xy_{ij}$ where each \begin{equation} xy_{ij} = \sum_{k=1}^n x_{ik} * y_{jk} \end{equation} Consider our matrix Xij with 10 rows of 4 columns. Call an individual element in this matrix $x_{ij}$. We can find the sums for each column of the matrix by multiplying the matrix by our ``one" vector with Xij. That is, we can find $\sum_{i=1}^N X_{ij} $ for the j columns, and then divide by the number (n) of rows. (Note that we can get the same result by finding colMeans(Xij). We can use the dim function to find out how many cases (the number of rows) or the number of variables (number of columns). dim has two elements: dim(Xij)[1] = number of rows, dim(Xij)[2] is the number of columns. <>= dim(Xij) n <- dim(Xij)[1] #the first element of dim is rows, the second columns one <- rep(1,n) X.means <- t(one) %*% Xij/n @ A built in function to find the means of the columns is \textbf{colMeans}. (See \textbf{rowMeans} for the equivalent for rows.) <>= colMeans(Xij) @ Variances and covariances are measures of dispersion around the mean. We find these by first subtracting the means from all the observations. This means centered matrix is the original matrix minus a matrix of means. To make them have the same dimensions we premultiply the means vector by a vector of ones and subtract this from the data matrix. <>= X.diff <- Xij -one %*% X.means @ To find the variance/covariance matrix, we can first find the the inner product of the means centered matrix X.diff = Xij - X.means t(Xij-X.means) with itself and divide by n-1. We can compare this result to the result of the \begin{emph} cov \end{emph} function (the normal way to find covariances). <>= X.cov <- t(X.diff) %*% X.diff /(n-1) round(X.cov,2) round(cov(Xij),2) #compare with the covariance operation @ \subsection{Finding and using the diagonal} Some operations need to find just the diagonal. For instance, the diagonal of the matrix X.cov (found above) contains the variances of the items. To extract just the diagonal, or create a matrix with a particular diagonal we use the \textbf{diag} command. We can convert the covariance matrix X.cov to a correlation matrix X.cor by pre and post multiplying the covariance matrix with a diagonal matrix containing the reciprocal of the standard deviations (square roots of the variances). Remember that the correlation, $r_{xy}$, is merely the $covariance_{xy}/\sqrt{V_xV_y}$. Compare this to the standard command for finding correlations \textbf{cor}. <>= round(X.cov,2) round(diag(X.cov),2) sdi <- diag(1/sqrt(diag(X.cov))) rownames(sdi) <- colnames(sdi) <- colnames(X.cov) round(sdi,2) X.cor <- sdi %*% X.cov %*% sdi rownames(X.cor) <- colnames(X.cor) <- colnames(X.cov) round(X.cor,2) round(cor(Xij),2) @ \subsection{The Identity Matrix} The identity matrix is merely that matrix, which when multiplied by another matrix, yields the other matrix. (The equivalent of 1 in normal arithmetic.) It is a diagonal matrix with 1 on the diagonal I <- diag(1,nrow=dim(X.cov)[1],ncol=dim(X.cov)[2]) \subsection{Matrix Inversion} The inverse of a square matrix is the matrix equivalent of dividing by that matrix. That is, either pre or post multiplying a matrix by its inverse yields the identity matrix. The inverse is particularly important in multiple regression, for it allows is to solve for the beta weights. Given the equation \begin{equation}Y = bX + c \end{equation} we can solve for b by multiplying both sides of the equation by \begin{equation} X^{-1} or YX^{-1} = b XX^{-1} = b \end{equation} We can find the inverse by using the \textbf{solve} function. To show that $XX^{-1}$ = $X^{-1}X$ = I, we do the multiplication. <>= X.inv <- solve(X.cov) round(X.cov %*% X.inv,2) round(X.inv %*% X.cov,2) @ There are multiple ways of finding the matrix inverse, solve is just one of them. \section{Matrix operations for data manipulation} Using the basic matrix operations of addition and multiplication allow for easy manipulation of data. In particular, finding subsets of data, scoring multiple scales for one set of items, or finding correlations and reliabilities of composite scales are all operations that are easy to do with matrix operations. In the next example we consider 5 extraversion items for 200 subjects collected as part of the Synthetic Aperture Personality Assessment project. The items are taken from the \href{http://ipip.ori.org}{International Personality Item Pool (ipip.ori.org)}. The data are stored at the \href{http://personality-project.org}{personality-project.org} web site and may be retrieved in R. Because the first column of the data matrix is the subject identification number, we remove this before doing our calculations. <>= datafilename="http://personality-project.org/R/datasets/extraversion.items.txt" items=read.table(datafilename,header=TRUE) #read the data from the remote file items <- items[,-1] #the first item is an identification field and should be dropped dim(items) @ We first use functions from the \href{http://personality-project.org/r/psych-manual.pdf}{psych} package to describe these data both numerically and graphically. (The psych package may be downloaded from the personality-project.org web page as a source file.) <>= library(psych) describe(items) pairs.panels(items) @ We can form two composite scales, one made up of the first 3 items, the other made up of the last 2 items. Note that the second (q1480) and fourth (q1180) are negatively correlated with the remaining 3 items. This implies that we should reverse these items before scoring. To form the composite scales, reverse the items, and find the covariances and then correlations between the scales may be done by matrix operations on either the items or on the covariances between the items. In either case, we want to define a ``keys" matrix describing which items to combine on which scale. The correlations are, of course, merely the covariances divided by the square root of the variances. \subsection{Matrix operations on the raw data} <>= keys <- matrix(c(1,-1,1,0,0, 0,0, 0,-1,1),ncol=2) X <- as.matrix(items) X.ij <- X %*% keys n <- dim(X.ij)[1] one <- rep(1,dim(X.ij)[1]) X.means <- t(one) %*% X.ij/n X.cov <- t(X.ij-one %*% X.means) %*% (X.ij-one %*% X.means) /(n-1) round(X.cov,2) X.sd <- diag(1/sqrt(diag(X.cov))) X.cor <- t(X.sd) %*% X.cov %*% (X.sd) round(X.cor,2) @ \subsection{Matrix operations on the correlation matrix} <>= keys <- matrix(c(1,-1,1,0,0, 0,0, 0,-1,1),ncol=2) X.cor <- cor(X) round(X.cor,2) X.cov <- t(keys) %*% X.cor %*% keys X.sd <- diag(1/sqrt(diag(X.cov))) X.cor <- t(X.sd) %*% X.cov %*% (X.sd) keys round(X.cov,2) round(X.cor,2) @ \subsection{Using matrices to find test reliability} The reliability of a test may be thought of as the correlation of the test with a test just like it. One conventional estimate of reliability, based upon the concepts from domain sampling theory, is coefficient alpha ($alpha$). For a test with just one factor, $\alpha$ is an estimate of the amount of the test variance due to that factor. However, if there are multiple factors in the test, $\alpha$ neither estimates how much the variance of the test is due to one, general factor, nor does it estimate the correlation of the test with another test just like it. (See \href{http://personality-project.org/revelle/publications/zinbarg.revelle.pmet.05.pdf}{Zinbarg et al., 2005} for a discussion of alternative estimates of reliabiity.) Given either a covariance or correlation matrix of items, $\alpha$ may be found by simple matrix operations: 1) V = the correlation or covariance matrix 2) Let Vt = the sum of all the items in the correlation matrix for that scale. 3) Let n = number of items in the scale 3) alpha = {(Vt - diag(V) )/Vt} * {n/(n-1)} To demonstrate the use of matrices to find coefficient $\alpha$, consider the five items measuring extraversion taken from the International Personality Item Pool. Two of the items need to be weighted negatively (reverse scored). Alpha may be found from either the correlation matrix (standardized alpha) or the covariance matrix (raw alpha). In the case of standardized alpha, the diag(V) is the same as the number of items. Using a key matrix, we can find the reliability of 3 different scales, the first is made up of the first 3 items, the second of the last 2, and the third is made up of all the items. <>= datafilename="http://personality-project.org/R/datasets/extraversion.items.txt" items=read.table(datafilename,header=TRUE) #read the data from the remote file items <- items[,-1] #the first item is an identification field and should be dropped key <- matrix(c(1,-1,1,0,0, 0,0,0,-1,1, 1,-1,1,-1,1),ncol=3) key raw.r <- cor(items) V <- t(key) %*% raw.r %*% key #reverse the 2nd and 4th items rownames(V) <- colnames(V) <- c("V1-3","V4-5","V1-5") round(V,2) #the covariances of the standardized items n <- diag( t(key) %*% key) alpha <- (diag(V) - n)/(diag(V))*(n/(n-1)) round(alpha,2) @ \section{Multiple correlation} Given a set of n predictors of a criterion variable, what is the optimal weighting of the n predictors? This is, of course, the problem of multiple correlation or multiple regression. Although we would normally use the \textbf{linear model (lm)} function to solve this problem, we can also do it from the raw data or from a matrix of covariances or correlations by using matrix operations and the solve function. Consider the data set, X, created in section \ref{addition}. If we want to predict V4 as a function of the first three variables, we can do so three different ways, using the raw data, using deviation scores of the raw data, or with the correlation matrix of the data. For simplicity, lets relable $V_4$ to be Y and $V_1$ ... $V_3$ to be $X_1$ ...$X_3$ and then define X as the first three columns and Y as the last column: <>= set.seed(42) Xij <- matrix(sample(seq(0,9),40,replace=TRUE),ncol=4) rownames(Xij) <- paste("S",seq(1,dim(Xij)[1]),sep="") colnames(Xij) <- c("X1","X2","X3","Y") X <- Xij[,1:3] Y <- Xij[,4] print(X) print(Y) @ \subsection{Data level analyses} At the data level, we can work with the raw data matrix X, or convert these to deviation scores (X.dev) by subtracting the means from all elements of X. At the raw data level we have \begin{equation} _m\hat{Y}_1 = _mX_{nn} \beta _1 + _m \epsilon _1 \end{equation} and we can solve for $_n\beta _1$ by pre multiplying by $_nX'_m$ (thus making the matrix on the right side of the equation into a square matrix so that we can multiply through by an inverse.) \begin{equation}_nX'_{m m}\hat{Y}_1 = _nX'_{m m}X_{nn} \beta _1 + _m \epsilon _1 \end{equation} and then solving for beta by pre multiplying both sides of the equation by $(XX')^{-1}$ \begin{equation} \beta = (XX')^{-1}X'Y \end{equation} These beta weights will be the weights with no intercept. Compare this solution to the one using the lm function with the intercept removed: <>= beta <- solve(t(X) %*% X) %*% (t(X) %*% Y) round(beta,2) lm(Y ~ -1 + X) @ If we want to find the intercept as well, we can add a column of 1's to the X matrix. <>= one <- rep(1,dim(X)[1]) X <- cbind(one,X) print(X) beta <- solve(t(X) %*% X) %*% (t(X) %*% Y) round(beta,2) lm(Y ~ X) @ We can do the same analysis with deviation scores. Let X.dev be a matrix of deviation scores, then can write the equation \begin{equation} \hat{Y} = X\beta + \epsilon \end{equation} and solve for \begin{equation} \beta = (X.devX.dev')^{-1} X.dev'Y \end{equation}. (We don't need to worry about the sample size here because n cancels out of the equation). At the structure level, the covariance matrix = XX'/(n-1) and X'Y/(n-1) may be replaced by correlation matrices by pre and post multiplying by a diagonal matrix of 1/sds) with $r_{xy} $ and we then solve the equation \begin{equation} \beta = R^{-1} r_{xy} \end{equation} Consider the set of 3 variables with intercorrelations (R) \begin{verbatim} x1 x2 x3 x1 1.00 0.56 0.48 x2 0.56 1.00 0.42 x3 0.48 0.42 1.00 \end{verbatim} and correlations of x with y ( $r_{xy} $) \begin{verbatim} x1 x2 x3 y 0.4 0.35 0.3 \end{verbatim} From the correlation matrix, we can use the solve function to find the optimal beta weights. <>= R <- matrix(c(1,.56, .48, .56, 1, .42, .48, .42, 1),ncol =3) rxy <- matrix(c(.4, .35, .3),ncol=1) colnames(R) <- rownames(R) <- c("x1","x2","x3") rownames(rxy) <- c("x1","x2","x3") colnames(rxy) <- "y" beta <- solve(R,rxy) round(beta,2) @ \subsection{Non optimal weights and the goodness of fit} Although the beta weights are optimal given the data, it is well known (e.g., the robust beauty of linear models by Robyn Dawes) that if the predictors are all adequate and in the same direction, the error in prediction of Y is rather insensitive to the weights that are used. This can be shown graphically by comparing varying the weights of x1 and x2 relative to x3 and then finding the error in prediction. Note that the surface is relatively flat at its minimum. We show this for sevaral different values of the $r_{xy}$ and R matrices by first defining two functions (f and g) and then applying these functions with different values of R and $r_{xy}$. The first, f, finds the multiple r for values of $b_{x1}/b_{x3} $and $b_{x2}/b_{x3}$ for any value or set of values given by the second function. ranging from low to high and then find the error variance ($1-r^2$) for each case. <>= f <- function(x,y) { xy <- diag(c(x,y,1)) c <- rxy %*% xy d <- xy %*% R %*% xy cd <- sum(c)/sqrt(sum(d)) return(cd) } g <- function(rxy,R,n=60, low=-2,high=4,...) { op <- par(bg = "white") x <- seq(low,high,length=n) y <- x z <- outer(x, y) for (i in 1:n) { for (j in 1:n) { r <- f(x[i],y[j]) z[i,j] <- 1- r^2 }} persp(x, y, z, theta = 40, phi = 30, expand = 0.5, col = "lightblue", ltheta = 120, shade = 0.75, ticktype = "detailed",zlim =c(.5,1), xlab = "x1/x3", ylab = "x2/x3", zlab = "Error") zmin <- which.min(z) ymin <- trunc(zmin/n) xmin <- zmin - ymin*n xval <- x[xmin+1] yval <- y[trunc(ymin)+1] title(paste("Error as function of relative weights min values at x1/x3 = ", round(xval,1)," x2/x3 = ",round(yval,1))) } R <- matrix(c(1,.56, .48, .56, 1, .42, .48, .42, 1),ncol =3) rxy <- matrix(c(.4, .35, .3),nrow=1) colnames(R) <- rownames(R) <- c("x1","x2","x3") colnames(rxy) <- c("x1","x2","x3") rownames(rxy) <- "y" @ \newpage Graph of residual error variance as a function of relative weights of $b_{x1}/b_{x3} $and $b_{x2}/b_{x3}$ for <>= R rxy g(R,rxy) @ \newpage Show the response surface for uncorrelated predictors: <>= R <- matrix(c(1,.0, 0, .0, 1, 0, .0, .0, 1),ncol =3) rxy <- matrix(c(.4, .35, .3),nrow=1) colnames(R) <- rownames(R) <- c("x1","x2","x3") colnames(rxy) <- c("x1","x2","x3") rownames(rxy) <- "y" R rxy g(R,rxy) @ \end{document}