Principal Component Analysis

Balloon Fiesta at Albuquerque, NM. Some rights reserved.

Balloon Fiesta at Albuquerque, NM. Some rights reserved.

My notes about Principal Component Analysis (PCA)

What is PCA used for?

Dimensionality reduction.

How does PCA work?

PCA determines the directions along which the variance of the data is high. These directions are principle components.

Normally, two ways are used to do PCA, which are Covariance matrix eigendecomposition and Singular Value Decomposition. I’ll talk about them separately below.

Covariance Matrix Eigendecomposition

aka EVD (Eigen Value Decomposition)

How to find the directions with largest variance?

Get eigenvectors matrix and eigenvalues, choose the r eigenvectors with the first r largest eigenvalues. Mathematically, These r eigenvectors are principle components.

But why eigenvectors have the largest variance?

First, PCA does not change the data, it just changes the way we look at these data points. Because the eigenvectors of the covariance matrix are orthogonal to each other, we just re-base the coordinates by changing basis vectors.

Let’s say covariance of original dataset is Σ. For a normalized unit vector u, the data points projection along this vector are uᵀxᵢ. It’s not hard to find the variance of the data points projection is uᵀΣu, which we want to maximize.

Maximizing any function of the form uᵀΣu can be formulated as a so called Rayleigh Quotient. The maximum of such a Rayleigh Quotient is obtained by setting u equal to the largest eigenvector of matrix Σ. For more details, check out “Rayleigh Quotient”, Page 2 in this paper.

That’s how Eigenvalues and Eigenvectors come into play.

How to get eigenvectors and eigenvalues?

Because covariance matrix is a symmetric matrix, eigenvectors of this matrix are orthogonal and we can take advantage of the equation Ax=λx, in which A is the square matrix, x is eigenvectors and λ is eigenvalues. To calculate eigenvectors and eigenvalues, use det(λI−A)=0.

Singular Value Decomposition (SVD)

This method is more popular.

Why SVD makes sense?

SVD states that any (not need to be square) matrix X can be factorized as X ≈ UᵣΣᵣVᵣᵀ. U(left singular value) and V(right singular value) are orthogonal matrices with orthonormal eigenvectors chosen from XXᵀ and XᵀX respectively. Σ is a diagonal matrix with r elements in descending order equal to the square root of the eigenvalues of XXᵀ or XᵀX. We call such element “Singular Value”.

XᵀX and XXᵀ are square matrices. We can diagonalize such square matrix by decomposing it into eigenvectors matrix, the diagonal matrix constructed from the corresponding eigenvalues, and eigenvectors matrix inverse.

In this way, we can take XᵀX ≈ (UᵣΣᵣVᵣᵀ)ᵀ (UᵣΣᵣVᵣᵀ) = VᵣΣᵣ²Vᵣᵀ and XXᵀ ≈ (UᵣΣᵣVᵣᵀ) (UᵣΣᵣVᵣᵀ)ᵀ = UᵣΣᵣ²Uᵣᵀ, XᵀXV = Σ²V and XXᵀU = Σ²U. Thus computing the U, V and Σ of X yields the PCA of X. To Keep r principle components, calculate Xᵣ = Σ₁U₁V₁ᵀ + ... + ΣᵣUᵣVᵣᵀ.

Why is SVD fast?

Assume matrix is m x n. For Eigenvalue Decomposition, the complexity of covariance matrix computation is O(mn²).The complexity of eigendecomposition is O(m³). So the total cost is O(mn² + n³). For SVD, The complexity for SVD is in O(min(mn²,m^²)).

Other faster algorithms for computing the SVD

Randomized Matrix Approximation and Nyström method.