= matrix(c(1,2,3,4),nrow=2,ncol=2)
A = matrix(c(5,6,7,8),nrow=2,ncol=2)
B
+ B #element wise addition A
[,1] [,2]
[1,] 6 10
[2,] 8 12
In this class, we will be using matrices to help us organize data and information. Thus, it is useful to know some basic definitions and properties of matrices. Linear algebra is not a prerequisite for this course, so if this is new to you, don’t fret!
Use the this mathematical section as a reference as we go throughout the course. I’ve purposefully put more here than we might need. If I refer to a property or term that you are unfamiliar with, come back to check this section first.
If \(\mathbf{A}\) is a (\(r \times c\)) matrix and \(\mathbf{B}\) is a (\(r \times c\)) matrix (Note: nrow(A) = nrow(B) and ncol(A) = ncol(B)!), then \(\mathbf{A + B}\) is a (\(r \times c\)) matrix with \((i, j)\)th element \[(\mathbf{A + B})_{ij} = a_{ij}+ b_{ij}\]
= matrix(c(1,2,3,4),nrow=2,ncol=2)
A = matrix(c(5,6,7,8),nrow=2,ncol=2)
B
+ B #element wise addition A
[,1] [,2]
[1,] 6 10
[2,] 8 12
Here are some useful properties of matrices:
\[ \mathbf{A}+ \mathbf{B} = \mathbf{B + A}\]
\[ \mathbf{A} + (\mathbf{B} + \mathbf{C}) = (\mathbf{A} + \mathbf{B}) + \mathbf{C}\]
Additive Identity \[\mathbf{A} + \mathbf{0}= \mathbf{A}\] where \(\mathbf{0}\) is a \(r\times c\) matrix with 0 elements.
Additive Inverse
If \(\mathbf{A}\) is a (\(r \times q\)) matrix and \(\mathbf{B}\) is a (\(q \times c\)) matrix (Note: ncol(A) = nrow(B)!), then \(\mathbf{AB}\) is a (\(r \times c\)) matrix with \((i, j)\)th element \[(\mathbf{AB})_{ij} = \sum^q_{k=1}a_{ik}b_{kj}\] where \(a_{ik}\) is the element in the \(i\)th row and \(k\)th column of \(\mathbf{A}\) and \(b_{kj}\) is the element in the \(k\)th row and \(j\)th column of \(\mathbf{B}\).
Here, we say that \(\mathbf{A}\) is postmultiplied by \(\mathbf{B}\) and, equivalently, that \(\mathbf{B}\) is premultiplied by \(\mathbf{A}\). Here, the order matters!
= matrix(c(1,2,3,4),nrow=2,ncol=2)
A = matrix(c(5,6,7,8),nrow=2,ncol=2)
B
* B #element wise multiplication (not what you want....) A
[,1] [,2]
[1,] 5 21
[2,] 12 32
%*% B #matrix multiplication A
[,1] [,2]
[1,] 23 31
[2,] 34 46
Here are some useful properties of matrices:
\[ \mathbf{A}(\mathbf{B}\mathbf{C}) = (\mathbf{AB})\mathbf{C}\]
\[ \mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{AB} + \mathbf{AC}, (\mathbf{A} + \mathbf{B})\mathbf{C} = \mathbf{AC} + \mathbf{BC}\]
The transpose of any (\(r\times c\)) \(\mathbf{A}\) matrix is the (\(c \times r\)) matrix denoted as \(\mathbf{A}^T\) or \(\mathbf{A}'\) such that \(a_{ij}\) is replaced by \(a_{ji}\) everywhere.
A
[,1] [,2]
[1,] 1 3
[2,] 2 4
t(A)
[,1] [,2]
[1,] 1 2
[2,] 3 4
A matrix is square if \(c=r\). The diagonal of a square matrix are the elements of \(a_{ii}\) and the off-diagonal elements of a square matrix are \(a_{ij}\) where \(i\not=j\).
If \(\mathbf{A}\) is symmetric, then \(\mathbf{A} = \mathbf{A}^T\).
For any matrix \(\mathbf{A}\), \(\mathbf{A}^T\mathbf{A}\) will be a square matrix.
The transpose of a matrix product is \[(\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T\] The transpose of a matrix sum is \[(\mathbf{A}+\mathbf{B})^T = \mathbf{A}^T+\mathbf{B}^T\]
For two vectors \(\mathbf{x}\) and \(\mathbf{y}\), the standard inner (dot) product of the two vectors in \(\mathbb{R}^k\) is \[<\mathbf{x},\mathbf{y}> = \mathbf{x}^T\mathbf{y} = \sum_{i=1}^k x_iy_i = x_{1}y_1 + x_2y_2+\cdots+x_ky_k\]
To be an inner product, it has to satisfy 1) \(<\mathbf{x},\mathbf{x}>\geq 0\) with equality if \(\mathbf{x}=\mathbf{0}\) 2) \(<\mathbf{x},\mathbf{y}> = <\mathbf{y},\mathbf{x}>\) and 3) \(<a\mathbf{x}+b\mathbf{y},\mathbf{z}> = a<\mathbf{x},\mathbf{z}>+b<\mathbf{y},\mathbf{z}>\)
= 1:10
x = 21:30
y x
[1] 1 2 3 4 5 6 7 8 9 10
y
[1] 21 22 23 24 25 26 27 28 29 30
class(x); class(y)
[1] "integer"
[1] "integer"
%*% y #If x is a numeric vector (not a matrix), R doesn't differentiate between a column vector or row vector x
[,1]
[1,] 1485
= matrix(x,ncol=1) #If x is a column vector (matrix with 1 column), it matters.
x = matrix(y,ncol=1)
y
class(x); class(y)
[1] "matrix" "array"
[1] "matrix" "array"
#x%*%y #Get an Error!
t(x)%*%y
[,1]
[1,] 1485
The vector difference between two p-dimensional vectors \(\mathbf{x}\) and \(\mathbf{y}\) is calculated element by element,
\[(\mathbf{x - y})_{i} = x_{i}- y_{i}\]
For a vector \(\mathbf{x}\in \mathbb{R}^k\), a general vector norm \(||\mathbf{x}||_p\) for \(p=1,2,...\) is defined as \[||\mathbf{x}||_p = \left(\sum_{i=1}^m |x_i|^p \right)^{1/p} \] As \(p\rightarrow \infty\), we have a special case, \[||\mathbf{x}||_{\infty} = \max_i |x_i|\]
To be a norm, the following must be true, 1) \(||\mathbf{x}|| \geq 0\), 2) \(||a\mathbf{x}|| =|a|||\mathbf{x}||\), and 3) \(||\mathbf{x}+\mathbf{y}|| \leq ||\mathbf{x}|| + ||\mathbf{y}||\) (triangle inequality).
sqrt(t(x)%*%x)
[,1]
[1,] 19.62142
sqrt(sum(x^2))
[1] 19.62142
This is the distance ``as the crow flies.’’
sqrt(sum((x-y)^2))
[1] 63.24555
sqrt(t(x-y)%*%(x-y))
[,1]
[1,] 63.24555
If \(p=2\), then you get the Euclidean distance (as the crow flies), \[d(\mathbf{x},\mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2 }\]
If \(p=1\), then you get the Manhattan distance (city-block distance), \[d(\mathbf{x},\mathbf{y}) = \sum_i |x_i - y_i| \]
If \(p\rightarrow\infty\), then you get Chebyshev distance, \[d(\mathbf{x},\mathbf{y}) =||\mathbf{x}-\mathbf{y}||_{\infty} = \max_i |x_i-y_i|\]
A vector space is a set of vectors which is closed under vector addition and scalar multiplication. They must also adhere to typical axioms such as associativity, commutativity, etc (8 in all). We will focus on real vector spaces on \(\mathbb{R}^k\).
The vector \(\mathbf{y} = a_1\mathbf{x_1}+a_2\mathbf{x_2}+\cdots+a_k\mathbf{x_k}\) is a linear combination of the vectors \(\mathbf{x_1},\mathbf{x_2},...,\mathbf{x_k}\). The set of all linear combinations of \(\mathbf{x_1},\mathbf{x_2},...,\mathbf{x_k}\) is their linear span.
A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the other vectors. In other words, if there exists \(k\) numbers (\(a_1,a_2,...,a_k\)), not all zero, such that \[a_1\mathbf{x_1}+a_2\mathbf{x_2}+\cdots+a_k\mathbf{x_k} = \mathbf{0} \] If no vector in the set can be written in this way such that \(a_1,...,a_k\) have to be 0, then the vectors are said to be linearly independent.
Theorem: For a vector space \(V\), a set of vectors are linearly dependent if and only if the matrix of the vectors is singular (see below for the definition of singularity).
Basis: A basis is a set of vectors that spans the whole vector space (i.e. any vector in the vector space can be written as a linear combination of basis elements) and are linearly independent.
The rank of \(\mathbf{A}\) is dimension of row space of \(\mathbf{A}\) (space spanned by rows of \(\mathbf{A}\)) which equals the dimension of the column space of \(\mathbf{A}\) (space spanned by columns of \(\mathbf{A}\)).
A square matrix \(\mathbf{A}\) is nonsingular if \(\mathbf{Ax} = \mathbf{0}\) implies that \(\mathbf{x} = \mathbf{0}\). If a matrix fails to be nonsingular, it is called singular. - A square matrix is nonsingular if its rank is equal to the number of rows or columns.
The determinant of a square \(k\times k\) matrix \(\mathbf{A}\) is the scalar \[|\mathbf{A}| = \sum^k_{i=1}a_{1j}|\mathbf{A}_{1j}|(-1)^{1+j} \] where \(\mathbf{A}_{1j}\) is the \((k-1)\times(k-1)\) matrix obtained by deleting the first row and the \(j\)th column of \(\mathbf{A}\).
A
[,1] [,2]
[1,] 1 3
[2,] 2 4
det(A)
[1] -2
The matrix \(\mathbf{B}\) such that \(\mathbf{AB} = \mathbf{BA} = \mathbf{I}\) (where \(\mathbf{I}\) is the identity matrix) is called the inverse of \(\mathbf{A}\) and is denoted by \(\mathbf{A}^{-1}\).
If \(\mathbf{A}\) is a nonsingular square matrix, then there is a unique matrix \(\mathbf{B}\) such that \[\mathbf{AB} = \mathbf{BA} = \mathbf{I} \] To show the existence of an inverse, it is also equivalent to show that the determinant of \(\mathbf{A}\) is not zero.
For square matrices \(\mathbf{A}\) and \(\mathbf{B}\) with the same dimension, \[(\mathbf{A}^{-1})^T = (\mathbf{A}^T)^{-1}\] \[(\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\]
The trace of a square matrix is the sum of the diagonal elements, \(tr(\mathbf{A}) = \sum_j a_{jj}\).
\[tr(c\mathbf{A}) = c\; tr(\mathbf{A})\] \[tr(\mathbf{A}\pm \mathbf{B}) = tr(\mathbf{A})\pm tr(\mathbf{B})\] \[tr(\mathbf{AB}) = tr(\mathbf{BA})\]
The projection of a vector \(\mathbf{y}\) on a vector \(\mathbf{x}\) is \[\frac{\mathbf{y}^T\mathbf{x}}{|\mathbf{x}|_2^2}\mathbf{x} \]
The orthogonal projection of a vector \(\mathbf{y}\) on a the column space of matrix \(\mathbf{X}\) gives you the \(\hat{\mathbf{y}} = \mathbf{X}\boldsymbol\beta\) that minimizes \(||\mathbf{y} - \hat{\mathbf{y}}||\).
\[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]
A square matrix is orthogonal if its rows, considered as vectors, are mutually perpendicular and have unit lengths; that is, \(\mathbf{A}\mathbf{A}^T = \mathbf{I}\).
The scalars, \(\lambda\), that satisfy the polynomial characteristic equation \(|\mathbf{A}-\lambda\mathbf{I}| = 0\) are called eigenvalues of matrix \(\mathbf{A}\).
If \(\mathbf{e}\) is a non-zero vector such that \[\mathbf{Ae} = \lambda\mathbf{e} \] then \(\mathbf{e}\) is said to be an eigenvector of the matrix \(\mathbf{A}\) associated with the eigenvalue \(\lambda\).
A quadratic form in \(k\) variables \(x_1,...,x_k\) is \(\mathbf{x}^T\mathbf{Ax}\), where \(\mathbf{x}^T = [x_1,...,x_k]\) and \(\mathbf{A}\) is a symmetric \(k\times k\) matrix. It can be written as \(\sum^k_{i=1}\sum^k_{j=1}a_{ij}x_ix_j\).
A symmetric matrix is positive definite if \[\mathbf{x}^T\mathbf{Ax}>0\] for all vectors \(\mathbf{x}\not= \mathbf{0}\). If a matrix is positive definite, then all of the eigenvalues will be positive.
A symmetric matrix is nonnegative definite if \[\mathbf{x}^T\mathbf{Ax}\geq 0\] for all vectors \(\mathbf{x}\not= \mathbf{0}\).
Cholesky Decomposition: For a real-valued symmetric positive definite matrix \(\mathbf{A}\), it can be decomposed as \[\mathbf{A} = \mathbf{L}\mathbf{L}^T \] where \(\mathbf{L}\) is a lower triangular matrix with real and positive diagonal elements.
Let \(y_i\) be the outcome for individual \(i=1,...,n\) and \(\mathbf{x}_i\) be a vector of explanatory variables for individual \(i\), then when we fit a linear regression model, we want to find the slopes, \(\boldsymbol\beta = (\beta_0,...,\beta_p)\), that minimize the sum of squared errors,
\[\sum_i (y_i - \mathbf{x}_i^T\boldsymbol\beta)^2\]
If we stack the \(y_i\) on top of each other into a \(n\times 1\) vector \(\mathbf{y}\) and stack the \(\mathbf{x}_i\) on top of each other into a \(n \times (p+1)\) matrix \(\mathbf{X}\), we can write the sum of squared errors as an inner product,
\[ (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^T(\mathbf{y} - \mathbf{X}\boldsymbol\beta)\]
By expanding this using matrix multiplication and properties, we get
\[ \mathbf{y}^T\mathbf{y} - \boldsymbol\beta^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol\beta + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \] This can be simplified to
\[ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \]
To find the value of \(\boldsymbol\beta\) that minimizes this quantity, we can find the derivative and set the equations equal to zero.
\[ \frac{\partial}{\partial \boldsymbol \beta} [ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta ] = -2\mathbf{X}^T\mathbf{y}+ 2\mathbf{X}^T\mathbf{X} \]
If we set this equal to 0 and solve for \(\boldsymbol\beta\), we get
\[\hat{\boldsymbol\beta}_{OLS}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
Another way to approach Least Squares is to say that we want to find the unique \(\hat{\mathbf{y}}\) in \(Col(X)\), the column space of \(X\), such that we minimize \(||\mathbf{y} - \hat{\mathbf{y}}||\)$. In order for \(\hat{\mathbf{y}}\) to be in the column space of \(\mathbf{X}\), \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\).
The orthogonal projection of a vector \(\mathbf{y}\) on a the column space of matrix \(\mathbf{X}\) is \[\hat{\mathbf{y}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]
and therefore,
\[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]