Appendix A — Matrix Algebra

In this class, we will be using matrices to help us organize data and information. Thus, it is useful to know some basic definitions and properties of matrices. Linear algebra is not a prerequisite for this course, so if this is new to you, don’t fret!

Use the this mathematical section as a reference as we go throughout the course. I’ve purposefully put more here than we might need. If I refer to a property or term that you are unfamiliar with, come back to check this section first.

A.1 Matrix & Vector Addition

If \(\mathbf{A}\) is a (\(r \times c\)) matrix and \(\mathbf{B}\) is a (\(r \times c\)) matrix (Note: nrow(A) = nrow(B) and ncol(A) = ncol(B)!), then \(\mathbf{A + B}\) is a (\(r \times c\)) matrix with \((i, j)\)th element \[(\mathbf{A + B})_{ij} = a_{ij}+ b_{ij}\]

A = matrix(c(1,2,3,4),nrow=2,ncol=2)
B = matrix(c(5,6,7,8),nrow=2,ncol=2)

A + B #element wise addition
     [,1] [,2]
[1,]    6   10
[2,]    8   12

A.1.1 Properties

Here are some useful properties of matrices:

  1. Commutative Property

\[ \mathbf{A}+ \mathbf{B} = \mathbf{B + A}\]

  1. Associative Property

\[ \mathbf{A} + (\mathbf{B} + \mathbf{C}) = (\mathbf{A} + \mathbf{B}) + \mathbf{C}\]

  1. Additive Identity \[\mathbf{A} + \mathbf{0}= \mathbf{A}\] where \(\mathbf{0}\) is a \(r\times c\) matrix with 0 elements.

  2. Additive Inverse

  • For any \(r\times c\) matrix \(\mathbf{A}\), there is a \(r\times c\) matrix \(\mathbf{B}\) (\(= -\mathbf{A}\)) such that \[\mathbf{A} + \mathbf{B}= \mathbf{0}\]

A.2 Matrix & Vector Multiplication

If \(\mathbf{A}\) is a (\(r \times q\)) matrix and \(\mathbf{B}\) is a (\(q \times c\)) matrix (Note: ncol(A) = nrow(B)!), then \(\mathbf{AB}\) is a (\(r \times c\)) matrix with \((i, j)\)th element \[(\mathbf{AB})_{ij} = \sum^q_{k=1}a_{ik}b_{kj}\] where \(a_{ik}\) is the element in the \(i\)th row and \(k\)th column of \(\mathbf{A}\) and \(b_{kj}\) is the element in the \(k\)th row and \(j\)th column of \(\mathbf{B}\).

Here, we say that \(\mathbf{A}\) is postmultiplied by \(\mathbf{B}\) and, equivalently, that \(\mathbf{B}\) is premultiplied by \(\mathbf{A}\). Here, the order matters!

A = matrix(c(1,2,3,4),nrow=2,ncol=2)
B = matrix(c(5,6,7,8),nrow=2,ncol=2)

A * B #element wise multiplication (not what you want....)
     [,1] [,2]
[1,]    5   21
[2,]   12   32
A %*% B #matrix multiplication
     [,1] [,2]
[1,]   23   31
[2,]   34   46

A.2.1 Properties

Here are some useful properties of matrices:

  1. Associative Property

\[ \mathbf{A}(\mathbf{B}\mathbf{C}) = (\mathbf{AB})\mathbf{C}\]

  1. Distributive Property

\[ \mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{AB} + \mathbf{AC}, (\mathbf{A} + \mathbf{B})\mathbf{C} = \mathbf{AC} + \mathbf{BC}\]

  1. Multiplicative Identity \[\mathbf{I_r A} = \mathbf{AI_c } = \mathbf{A}\] where \(\mathbf{I_r}\) is a \(r\times r\) matrix with 1’s along the diagonal and 0’s otherwise. Generally, we’ll drop the subscript and we’ll assume the dimension by its context.

A.3 Matrix Transpose

The transpose of any (\(r\times c\)) \(\mathbf{A}\) matrix is the (\(c \times r\)) matrix denoted as \(\mathbf{A}^T\) or \(\mathbf{A}'\) such that \(a_{ij}\) is replaced by \(a_{ji}\) everywhere.

A
     [,1] [,2]
[1,]    1    3
[2,]    2    4
t(A)
     [,1] [,2]
[1,]    1    2
[2,]    3    4

A matrix is square if \(c=r\). The diagonal of a square matrix are the elements of \(a_{ii}\) and the off-diagonal elements of a square matrix are \(a_{ij}\) where \(i\not=j\).

If \(\mathbf{A}\) is symmetric, then \(\mathbf{A} = \mathbf{A}^T\).

For any matrix \(\mathbf{A}\), \(\mathbf{A}^T\mathbf{A}\) will be a square matrix.

A.3.1 Properties

The transpose of a matrix product is \[(\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T\] The transpose of a matrix sum is \[(\mathbf{A}+\mathbf{B})^T = \mathbf{A}^T+\mathbf{B}^T\]

A.4 Inner product

For two vectors \(\mathbf{x}\) and \(\mathbf{y}\), the standard inner (dot) product of the two vectors in \(\mathbb{R}^k\) is \[<\mathbf{x},\mathbf{y}> = \mathbf{x}^T\mathbf{y} = \sum_{i=1}^k x_iy_i = x_{1}y_1 + x_2y_2+\cdots+x_ky_k\]

To be an inner product, it has to satisfy 1) \(<\mathbf{x},\mathbf{x}>\geq 0\) with equality if \(\mathbf{x}=\mathbf{0}\) 2) \(<\mathbf{x},\mathbf{y}> = <\mathbf{y},\mathbf{x}>\) and 3) \(<a\mathbf{x}+b\mathbf{y},\mathbf{z}> = a<\mathbf{x},\mathbf{z}>+b<\mathbf{y},\mathbf{z}>\)

x = 1:10
y = 21:30
x
 [1]  1  2  3  4  5  6  7  8  9 10
y
 [1] 21 22 23 24 25 26 27 28 29 30
class(x); class(y)
[1] "integer"
[1] "integer"
x %*% y #If x is a numeric vector (not a matrix), R doesn't differentiate between a column vector or row vector
     [,1]
[1,] 1485
x = matrix(x,ncol=1) #If x is a column vector (matrix with 1 column), it matters.
y = matrix(y,ncol=1)

class(x); class(y)
[1] "matrix" "array" 
[1] "matrix" "array" 
#x%*%y #Get an Error!

t(x)%*%y
     [,1]
[1,] 1485

A.5 Vector Difference

The vector difference between two p-dimensional vectors \(\mathbf{x}\) and \(\mathbf{y}\) is calculated element by element,

\[(\mathbf{x - y})_{i} = x_{i}- y_{i}\]

A.6 Vector Length

For a vector \(\mathbf{x}\in \mathbb{R}^k\), a general vector norm \(||\mathbf{x}||_p\) for \(p=1,2,...\) is defined as \[||\mathbf{x}||_p = \left(\sum_{i=1}^m |x_i|^p \right)^{1/p} \] As \(p\rightarrow \infty\), we have a special case, \[||\mathbf{x}||_{\infty} = \max_i |x_i|\]

  • L2-Norm: The most commonly used vector norm is when \(p=2\), \[||\mathbf{x}||_2 = \sqrt{\sum_{i=1}^m (x_i)^2} \]

To be a norm, the following must be true, 1) \(||\mathbf{x}|| \geq 0\), 2) \(||a\mathbf{x}|| =|a|||\mathbf{x}||\), and 3) \(||\mathbf{x}+\mathbf{y}|| \leq ||\mathbf{x}|| + ||\mathbf{y}||\) (triangle inequality).

sqrt(t(x)%*%x)
         [,1]
[1,] 19.62142
sqrt(sum(x^2))
[1] 19.62142

A.7 Vector Distance

  • Euclidean distance: In Euclidean space, we often call the L2-norm a Euclidean Norm. This gives the length of the vector from the origin. More generally, the Euclidean distance between two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is the L2-norm of the difference, \[d(\mathbf{x},\mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_2 = \sqrt{\sum_i (x_i - y_i)^2} = \sqrt{(\mathbf{x}-\mathbf{y})^T(\mathbf{x}-\mathbf{y})} \]

This is the distance ``as the crow flies.’’

sqrt(sum((x-y)^2))
[1] 63.24555
sqrt(t(x-y)%*%(x-y))
         [,1]
[1,] 63.24555
  • Minkowski distance: A more general distance measurement between two vectors is based on the general definition of the vector norm called the Minkowski distance, \[d(\mathbf{x},\mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_p = \left(\sum_i |x_i - y_i|^p \right)^{1/p} \]

If \(p=2\), then you get the Euclidean distance (as the crow flies), \[d(\mathbf{x},\mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2 }\]

If \(p=1\), then you get the Manhattan distance (city-block distance), \[d(\mathbf{x},\mathbf{y}) = \sum_i |x_i - y_i| \]

If \(p\rightarrow\infty\), then you get Chebyshev distance, \[d(\mathbf{x},\mathbf{y}) =||\mathbf{x}-\mathbf{y}||_{\infty} = \max_i |x_i-y_i|\]

  • Mahalanobis distance: In the context of random variables and data analysis, imagine trying to find the distance between two individuals based on values of \(k\) different variables. We could find the differences in values amongst all of the variables between the two individuals. However, the scale and units of each might be different. We could standardized each difference by dividing by the standard deviation of the variable amongst the entire sample so that each difference is comparable. In order to incorporate the dependence (covariance) between the variables as well, the Mahalanobis distance is a modified Euclidean distance and is calculated as \[d(\mathbf{x},\mathbf{y})= \sqrt{(\mathbf{x}-\mathbf{y})^T\mathbf{S}^{-1}(\mathbf{x}-\mathbf{y})} \] where \(\mathbf{S}\) is the \(k\times k\) sample covariance matrix of the variables based on the sample.

A.8 Vector Space

A vector space is a set of vectors which is closed under vector addition and scalar multiplication. They must also adhere to typical axioms such as associativity, commutativity, etc (8 in all). We will focus on real vector spaces on \(\mathbb{R}^k\).

The vector \(\mathbf{y} = a_1\mathbf{x_1}+a_2\mathbf{x_2}+\cdots+a_k\mathbf{x_k}\) is a linear combination of the vectors \(\mathbf{x_1},\mathbf{x_2},...,\mathbf{x_k}\). The set of all linear combinations of \(\mathbf{x_1},\mathbf{x_2},...,\mathbf{x_k}\) is their linear span.

A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the other vectors. In other words, if there exists \(k\) numbers (\(a_1,a_2,...,a_k\)), not all zero, such that \[a_1\mathbf{x_1}+a_2\mathbf{x_2}+\cdots+a_k\mathbf{x_k} = \mathbf{0} \] If no vector in the set can be written in this way such that \(a_1,...,a_k\) have to be 0, then the vectors are said to be linearly independent.

  • Theorem: For a vector space \(V\), a set of vectors are linearly dependent if and only if the matrix of the vectors is singular (see below for the definition of singularity).

  • Basis: A basis is a set of vectors that spans the whole vector space (i.e. any vector in the vector space can be written as a linear combination of basis elements) and are linearly independent.

A.9 Rank of matrix

The rank of \(\mathbf{A}\) is dimension of row space of \(\mathbf{A}\) (space spanned by rows of \(\mathbf{A}\)) which equals the dimension of the column space of \(\mathbf{A}\) (space spanned by columns of \(\mathbf{A}\)).

  • The rank is the maximum number of linearly independent columns of a matrix.
  • A matrix is full rank if the rank of the matrix is equal to the number of columns.

A.10 Singularity

A square matrix \(\mathbf{A}\) is nonsingular if \(\mathbf{Ax} = \mathbf{0}\) implies that \(\mathbf{x} = \mathbf{0}\). If a matrix fails to be nonsingular, it is called singular. - A square matrix is nonsingular if its rank is equal to the number of rows or columns.

A.11 Determinant

The determinant of a square \(k\times k\) matrix \(\mathbf{A}\) is the scalar \[|\mathbf{A}| = \sum^k_{i=1}a_{1j}|\mathbf{A}_{1j}|(-1)^{1+j} \] where \(\mathbf{A}_{1j}\) is the \((k-1)\times(k-1)\) matrix obtained by deleting the first row and the \(j\)th column of \(\mathbf{A}\).

A
     [,1] [,2]
[1,]    1    3
[2,]    2    4
det(A)
[1] -2
  • Theorem: The determinant of a matrix \(\mathbf{A}\) is 0 if and only if \(\mathbf{A}\) is singular.

A.12 Matrix Inverse

The matrix \(\mathbf{B}\) such that \(\mathbf{AB} = \mathbf{BA} = \mathbf{I}\) (where \(\mathbf{I}\) is the identity matrix) is called the inverse of \(\mathbf{A}\) and is denoted by \(\mathbf{A}^{-1}\).

  • If \(\mathbf{A}\) is a nonsingular square matrix, then there is a unique matrix \(\mathbf{B}\) such that \[\mathbf{AB} = \mathbf{BA} = \mathbf{I} \] To show the existence of an inverse, it is also equivalent to show that the determinant of \(\mathbf{A}\) is not zero.

  • For square matrices \(\mathbf{A}\) and \(\mathbf{B}\) with the same dimension, \[(\mathbf{A}^{-1})^T = (\mathbf{A}^T)^{-1}\] \[(\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\]

A.13 Trace of matrix for square matrix

The trace of a square matrix is the sum of the diagonal elements, \(tr(\mathbf{A}) = \sum_j a_{jj}\).

A.13.1 Properties

\[tr(c\mathbf{A}) = c\; tr(\mathbf{A})\] \[tr(\mathbf{A}\pm \mathbf{B}) = tr(\mathbf{A})\pm tr(\mathbf{B})\] \[tr(\mathbf{AB}) = tr(\mathbf{BA})\]

A.14 Vector Projection

The projection of a vector \(\mathbf{y}\) on a vector \(\mathbf{x}\) is \[\frac{\mathbf{y}^T\mathbf{x}}{|\mathbf{x}|_2^2}\mathbf{x} \]

The orthogonal projection of a vector \(\mathbf{y}\) on a the column space of matrix \(\mathbf{X}\) gives you the \(\hat{\mathbf{y}} = \mathbf{X}\boldsymbol\beta\) that minimizes \(||\mathbf{y} - \hat{\mathbf{y}}||\).

\[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

A.15 Orthogonality

A square matrix is orthogonal if its rows, considered as vectors, are mutually perpendicular and have unit lengths; that is, \(\mathbf{A}\mathbf{A}^T = \mathbf{I}\).

A.16 Eigenvalues and Eigenvectors

The scalars, \(\lambda\), that satisfy the polynomial characteristic equation \(|\mathbf{A}-\lambda\mathbf{I}| = 0\) are called eigenvalues of matrix \(\mathbf{A}\).

If \(\mathbf{e}\) is a non-zero vector such that \[\mathbf{Ae} = \lambda\mathbf{e} \] then \(\mathbf{e}\) is said to be an eigenvector of the matrix \(\mathbf{A}\) associated with the eigenvalue \(\lambda\).

A.17 Positive Definiteness

A quadratic form in \(k\) variables \(x_1,...,x_k\) is \(\mathbf{x}^T\mathbf{Ax}\), where \(\mathbf{x}^T = [x_1,...,x_k]\) and \(\mathbf{A}\) is a symmetric \(k\times k\) matrix. It can be written as \(\sum^k_{i=1}\sum^k_{j=1}a_{ij}x_ix_j\).

A symmetric matrix is positive definite if \[\mathbf{x}^T\mathbf{Ax}>0\] for all vectors \(\mathbf{x}\not= \mathbf{0}\). If a matrix is positive definite, then all of the eigenvalues will be positive.

A symmetric matrix is nonnegative definite if \[\mathbf{x}^T\mathbf{Ax}\geq 0\] for all vectors \(\mathbf{x}\not= \mathbf{0}\).

Cholesky Decomposition: For a real-valued symmetric positive definite matrix \(\mathbf{A}\), it can be decomposed as \[\mathbf{A} = \mathbf{L}\mathbf{L}^T \] where \(\mathbf{L}\) is a lower triangular matrix with real and positive diagonal elements.

A.18 (Ordinary) Least Squares

A.18.1 Calculus Approach

Let \(y_i\) be the outcome for individual \(i=1,...,n\) and \(\mathbf{x}_i\) be a vector of explanatory variables for individual \(i\), then when we fit a linear regression model, we want to find the slopes, \(\boldsymbol\beta = (\beta_0,...,\beta_p)\), that minimize the sum of squared errors,

\[\sum_i (y_i - \mathbf{x}_i^T\boldsymbol\beta)^2\]

If we stack the \(y_i\) on top of each other into a \(n\times 1\) vector \(\mathbf{y}\) and stack the \(\mathbf{x}_i\) on top of each other into a \(n \times (p+1)\) matrix \(\mathbf{X}\), we can write the sum of squared errors as an inner product,

\[ (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^T(\mathbf{y} - \mathbf{X}\boldsymbol\beta)\]

By expanding this using matrix multiplication and properties, we get

\[ \mathbf{y}^T\mathbf{y} - \boldsymbol\beta^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol\beta + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \] This can be simplified to

\[ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \]

To find the value of \(\boldsymbol\beta\) that minimizes this quantity, we can find the derivative and set the equations equal to zero.

\[ \frac{\partial}{\partial \boldsymbol \beta} [ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta ] = -2\mathbf{X}^T\mathbf{y}+ 2\mathbf{X}^T\mathbf{X} \]

If we set this equal to 0 and solve for \(\boldsymbol\beta\), we get

\[\hat{\boldsymbol\beta}_{OLS}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

A.18.2 Projection Approach

Another way to approach Least Squares is to say that we want to find the unique \(\hat{\mathbf{y}}\) in \(Col(X)\), the column space of \(X\), such that we minimize \(||\mathbf{y} - \hat{\mathbf{y}}||\)$. In order for \(\hat{\mathbf{y}}\) to be in the column space of \(\mathbf{X}\), \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\).

The orthogonal projection of a vector \(\mathbf{y}\) on a the column space of matrix \(\mathbf{X}\) is \[\hat{\mathbf{y}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

and therefore,

\[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]