Singular Value Decomposition and its various interpretations and applications (with interactive examples)

Published: 23 Mar 2025

💡 This is a long read, but don't miss the interactive examples I have sprinkled throughout that might help you with learning!

📌 Why am I doing this?

I love working on side projects. I always have to be coding and learning something, even if my work does not require me to do that anymore. However, what if the side project this time does not have to involve the act of creating, but it's purely about learning? I decided to choose a learning topic that is practical in both my work as a software engineer and game developer - "Linear Algebra".

I use Linear Algebra pretty frequently. I use basic vectors and matrix operations for game physics, camera projection and other game dev related work. But they often don't go beyond high-school level math. What if I want to create intelligent systems involving recommendation engines, 3D reconstruction, face recognition, image compression, large language models and...? It looks like there is more to learn about Linear Algebra!

I decided to approach my learning in a non-linear fashion. I decided to start with the topic of Singular Value Decomposition (SVD) first, and then returning to revise on the "intermediate" concepts (for e.g., "ranks", "subspaces", "linear independence", "orthogonality", "eigenvectors") as and when I need them. The result? I became much more intimately familiar with Linear Algebra as compared to when I first learnt it in college, because now I have a better idea about their applications and what they are leading towards!

This article is written in the same way I revised on Linear Algebra, by starting with Singular Value Decomposition first. I will add side-notes along the way that fill in the gaps in understanding important "intermediate concepts", as well as interactive examples to solidify our main concepts. Let's see if this learning method works for you!

Also, this article only serves as an introduction to this big and important topic in Linear Algebra. I hope to use this article as a starting point for more complex topics later on.

Table of Contents

Introducing SVD using linear transformations
Dimensionality Reduction and Data Compression
Finding hidden correlations and winning the Netflix Prize
Solving linear regression with SVD
Solving linear equations with SVD
The Moore-Penrose Pseudoinverse
Calculating the SVD of a matrix
Hacker News Comments

Introducing SVD using linear transformations

Most of the time when I am about to learn a new technical topic, I would visit Wikiepdia for an introduction first. And so... here is what Wikipedia has to say about Singular Value Decomposition.

📖 "In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix into a rotation, followed by a rescaling, followed by another rotation." - Wikipedia

In other words, we only need 3 matrices to form any complex matrix. If we visualize this complex matrix as a linear transformation, we can see it as:

A = R_{2} S R_{1}

What is a "linear transformation"?

To put it simply but not 100% accurately, a "linear transformation" happens when we transform a vector into another vector by multiplying it with a matrix. For example, let us transform a vector (x, y, z) by multiplying it with a matrix.

$[\begin{matrix} 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 0 \end{matrix}] [\begin{matrix} x \\ y \\ z \end{matrix}] = [\begin{matrix} 2x \\ 2y \\ 0 \end{matrix}]$

What we see here is that we have a new vector that is both stretched and has lost it's z-dimension. All vectors multiplied by this matrix will be "transformed" to the same effect.

To put it more accurately, a matrix multiplication actually "transforms" the space that the vector resides in. In this example, we are transforming the original 3D "vector space" to become 2D instead, and scaled 2 times along the x and y axes.

To quote Prof Gilbert Strang:

"Matrices act. They don't just sit there."

Note that in the example that is coming up, when we "scale" and "rotate" a shape, we don't exactly change the shape directly. It is more akin to us changing / transforming the space that the shape sits on, in order to change the shape.

Let us see this in action - step through the interactive visualization to transform the black square to fit the skewed grey shape with just the rotation > scale > rotation operations.

As illustrated above, we rotated the shape by $R_{1}$ = 45°, then we scaled it by $S$ = (1.5, 0.1, 1.0), then we rotated it again by $R_{2}$ = 45°. This is represented by the following expression:

A = R_{2} S R_{1} = [\begin{matrix} cos(45°) & -sin(45°) \\ sin(45°) & cos(45°) \end{matrix}] [\begin{matrix} 1.5 & 0 \\ 0 & 0.1 \end{matrix}] [\begin{matrix} cos(45°) & -sin(45°) \\ sin(45°) & cos(45°) \end{matrix}]

Let us rewrite this as the key expression that is commonly used to describe the SVD of a matrix:

A = U Σ V^{T} = [\begin{matrix} u_{1} & u_{2} & ... & u_{n} \end{matrix}] [\begin{matrix} σ_{1} \\ σ_{2} \\ ... \\ σ_{n} \end{matrix}] [\begin{matrix} {v_{1}}^{T} \\ {v_{2}}^{T} \\ ... \\ {v_{n}}^{T} \end{matrix}]

When we compare the expression $U Σ V^{T}$ with the expression $R_{2} S R_{1}$ , where $U = R_{2}$ , $V^{T} = R_{1}$ , and $Σ = S$ , we see that:

as $R_{1}$ and $R_{2}$ are rotation matrices, they are orthonormal. Similarly, $U$ and $V$ are orthonormal matrices too, i.e. they are made up of linearly independent unit vectors that are perpendicular to each other, represented by ${u_{1} ... u_{n}}$ and ${v_{1} ... v_{n}}$ . Their determinants are also equal to 1, as rotation matrices don't change the scale of the space they transform.
$Σ$ is a diagonal matrix that represents "scaling". It is the matrix that determines how the matrix A will transform the scale of the space along various dimensions.

Matrix transposition

Matrix transposition is the swapping of the rows and columns of the matrix.

$A = [\begin{matrix} a & b & c \\ d & e & f \end{matrix}] and A^{T} = [\begin{matrix} a & d \\ b & e \\ c & f \end{matrix}]$

This is helpful because of the rules behind certain matrix operations such as multiplication. For example, if we want to do a dot product between 2 vectors A and B, we can present it as:

$A^{T} B = [\begin{matrix} x & y & z \end{matrix}] [\begin{matrix} x' \\ y' \\ z' \end{matrix}] = (x.x') + (y.y') + (z.z')$

Linear independence and the dimensions of a vector space

A set of vectors are linearly independent if no vectors within the set can be combined to form any vectors in that same set. For example, the following matrix does not contain linearly independent vectors:

$A = [\begin{matrix} 1 & 2 & 0 \\ 2 & 4 & 1 \\ 3 & 6 & 0 \end{matrix}]$

This is because the second column vector is a multiple of the first column vector. Compare this to the next matrix, which contains linearly independent vectors:

$B = [\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}]$

Now, let's imagine that each column vector of a matrix represents a potential axis of the space it defines, i.e. looking at B, we have 3 vectors pointing in the traditional x, y, z axes of a 3D vector space. Whereas in A, only the first and last column vectors identifies the unique axes of its space, as the second column vector is simply a scaled version of the first column vector.

Why is this important? Because when we multiply by a matrix, we are transforming into the space defined by that matrix. And the number of linearly independent vectors tells us the number of dimensions and the axes of the space we are transforming into.

Orthogonality and rotation

2 vectors are orthogonal to each other if they are perpendicular; An orthogonal matrix is one which all column vectors are perpendicular to each other.

An orthogonal matrix naturally assumes that all it's column vectors are linearly independent to each other, as the vectors are all perpendicular!

An orthonormal matrix is one which all its vectors are orthogonal while also being unit vectors. Therefore it defines a linear transformation that does not have a scaling factor.

A rotation matrix should be orthonormal because it should purely rotate the subject without changing the volume or scale of the vector space, and hence of the subject.

Determinant

The determinant of a matrix describes by how much does the matrix change the "scale/area/volume" of the original space. An orthonormal matrix (and hence a rotation matrix) has a determinant of "1".

There are many resources online on how to calculate the determinant of a matrix. Hence we shall not cover it here.

Here's another way to look at this - let $v$ be a vector that we want to transform using the linear transformation $A$ . This is simply shown as:

$A v = x$

If we split the resulting vector $x$ as a scalar value with a unit vector, we shall get:

$A v = σ u$

Imagine that now instead of just transforming 1 vector, we are going to transform a series of vectors using this expression:

A [\begin{matrix} v_{1} & v_{2} & ... & v_{n} \end{matrix}] = [\begin{matrix} u_{1} & u_{2} & ... & u_{n} \end{matrix}] [\begin{matrix} σ_{1} \\ σ_{2} \\ ... \\ σ_{n} \end{matrix}]

which can be re-arranged as

$A V = U Σ$

$A = U Σ V^{T}$

But why does SVD matter? Because it exposes some interesting properties that can be used to solve a variety of problems!

Dimensionality Reduction and Data Compression

Consider this rank 1 matrix

A = [\begin{matrix} 1 & 2 & 3 & 4 \\ -1 & -2 & -3 & -4 \\ 2 & 4 & 6 & 8 \\ 10 & 20 & 30 & 40 \end{matrix}] = [\begin{matrix} 1 \\ -1 \\ 2 \\ 10 \end{matrix}] [\begin{matrix} 1 & 2 & 3 & 4 \end{matrix}] = u v^{T}

What is the "rank" of a matrix?

It is another way to describe the number of dimensions of the space defined by the matrix. We have to use another word (i.e. the word "rank") to describe "dimensions in space", because the number of rows of the matrix also describes "dimensions" in terms of the number of variables in the system

For example, a matrix with 3 rows may describe 3 variables ( $x, y, z$ ), i.e. 3 dimensions! But the vector space described by the same matrix does not need to be of 3 dimensions. As example, you can have a 3D vector lying on a 2D plane.

A matrix with n number of linearly independent vectors, and hence describing a vector space of n dimensions, will be of "rank n".

A matrix is said to be of "full rank" if the number of linearly independent vectors = smallest dimension of the matrix (either the number of rows or number of columns), and it is said to be "rank deficient" if otherwise.

This matrix is of rank = 1 because it only has 1 linearly independent column vector. By decomposing this bigger matrix to 2 smaller ones $u$ and $v^{T}$ , we represent the same matrix with less data (16 vs just 8 numbers!). This becomes significant when we have a bigger, more complex matrix.

Hence the idea is - when we factorize a complex matrix, we can compress the amount of data required to represent it! In this case, we have achieved lossless compression.

Now imagine we have a higher ranked matrix A. We cannot simply reduce it into 1 pair of row matrix and column matrix. But what we can do is to approximate A by using a sum of ranked 1 matrices like so:

$A = σ_{1} u_{1} {v_{1}}^{T} + σ_{2} u_{2} {v_{2}}^{T} + ... ... + σ_{n} u_{n} {v_{n}}^{T}$

$= [\begin{matrix} u_{1} & u_{2} & ... & u_{n} \end{matrix}] [\begin{matrix} σ_{1} \\ σ_{2} \\ ... \\ σ_{n} \end{matrix}] [\begin{matrix} {v_{1}}^{T} \\ {v_{2}}^{T} \\ ... \\ {v_{n}}^{T} \end{matrix}]$

$= U Σ V^{T}$

Again, we arrive at the standard SVD expression.

We see that $Σ$ is a diagonal matrix consisting of scalar values "σ". These are known as singular values. Each of them represents the "scale of influence" of its respective pair of left and right singular vectors $u$ and $v$ (i.e. a "singular vector pair").

In $Σ$ , these singular values σ are arranged in descending order of magnitude. This means that the singular vector pairs earlier in the sequence has a higher influence in approximating the complex matrix $A$ .

This is a helpful property of SVD, as that means we can remove singular vector pairs that has the least influence in approximating $A$ by turning their corresponding singular values σ to 0. This helps us achieve lossy compression with minimal information loss!

In order to solidify this concept further, let us look at this interactive example that uses the same skewed parallelogram we had. Use the slider to play with different values of $σ_{2}$ and see how it changes the shape.

$σ_{1}$ {{s3Sigma[0][0]}}

$σ_{2}$ {{s3Sigma[1][1]}}

$A = [\begin{matrix} u_{1} & u_{2} \end{matrix}] [\begin{matrix} σ_{1} & 0 \\ 0 & σ_{2} \end{matrix}] [\begin{matrix} {v_{1}}^{T} \\ {v_{2}}^{T} \end{matrix}]$

$= [\begin{matrix} cos(45°) & -sin(45°) \\ sin(45°) & cos(45°) \end{matrix}] [\begin{matrix} {{s3Sigma[0][0]}} & 0.00 \\ 0.00 & {{s3Sigma[1][1]}} \end{matrix}] [\begin{matrix} cos(45°) & -sin(45°) \\ sin(45°) & cos(45°) \end{matrix}]$

$= [\begin{matrix} {{s3A[0][0]}} & {{s3A[0][1]}} \\ {{s3A[1][0]}} & {{s3A[0][1]}} \end{matrix}]$

We can see that when we adjust the values for $σ_{1}$ and $σ_{2}$ , we change the scaling factor of their respective dimensions. As the parallelogram has the least variance along the dimension represented by $σ_{2}$ , when $σ_{2} = 0$ , we compress the parallelogram into a line that is relatively close to the original shape (i.e. having the least information loss).

The fact that the 2D parallelogram has been turned into a 1D line is also of significance! This shows the idea of dimensionality reduction - by turning $σ_{2} = 0$ , we have turned $Σ$ from being of rank = 2 to being of rank = 1, which in turn changes the rank of the linear transformation A from being of rank = 2 to being of rank = 1 as well (i.e. both matrices are now made up of only 1 linearly independent column vector)! This shows that the rank of $Σ$ actually exposes the rank of A. Cool!

Let us now visualize this concept of lossy compression via dimensionality reduction using another example. Adjust the "quality" slider to see how it compresses the image!

Quality {{s4Slider}}

σ_{1} = {{s4s[0][0]}}

σ_{2} = {{s4s[1][1]}}

σ_{3} = {{s4s[2][2]}}

σ_{4} = {{s4s[3][3]}}

σ_{5} = {{s4s[4][4]}}

σ_{6} ... σ_{10} = 0

In the image matrix, we can tell that out of 10 column vectors, there are only 5 that are linearly independent. This is because the left half of the image is the same as the right half.

Hence $Σ$ and the image matrix are both of rank = 5 (the last 5 singular values σ are 0), i.e. the last 5 singular vector pairs do not tell us any new patterns or information about the image, and so there is no change to the image when we adjust the "quality" slider between 50% - 100%. But the change to the image quality gets increasingly visible as the affected singular value gets bigger. The first 5 singular vector pairs have a higher influence in approximating our image matrix.

Finding hidden correlations and winning the Netflix Prize

On 2 October 2006, Netflix held a competition with a grand prize of US $1,000,000 for anyone who is able to beat Netflix's own algorithm for predicting user ratings for films, based on previous user ratings. Many teams that participated realized that SVD is the basis for the winning algorithm.

Not only did dimensionality reduction become a useful feature (given that the teams were given very large data matrices), SVD could also be used to find hidden patterns or correlations between the rows and columns of a data matrix!

💡 This kind of statistical inference from large data sets is one of the methods for machine learning! I believe this is one of the methods used by various online shoppnig and social media platforms for their recommendation engines.

As you might be able to tell based on intuition from the previous example, each singular vector pair describes the relationship of some unique "characteristics" of the image, and the strength of each relationship is shown in its corresponding singular value.

Another way to put this is that in $A = U Σ V^{T}$ :

$V$ describes the unique characteristics among the columns of $A$ ,
$U$ describes the unique characteristics among the rows of $A$ , and
$Σ$ describes the strength of the relationship between the unique column-wise and row-wise characteristics described by the corresponding singular vector pair in $U$ and $V$ .

Let us look at how this is reflected mathematically:

Let $A = U Σ V^{T}$ , and hence $A^{T} = V Σ^{T} U^{T}$

$A^{T} A = V Σ^{T} U^{T} U Σ V^{T}$

$A^{T} A = V Σ^{2} V^{T}$

$(A^{T} A) V = V Σ^{2}$

Here $A^{T} A$ is what is known as a covariance matrix (it describes the dot product, and hence the directional relationship) of each column in $A$ with every other columns in $A$ .

Also, if you haven't already noticed, $(A^{T} A) V = V Σ^{2}$ actually follows the equation for finding eigenvectors and eigenvalues. You should see that the eigenvectors are the column vectors of $V$ , and the eigenvalues are the squares of the singular values in $Σ$ !

This means that each column vector $v$ describes the column-wise characteristics of the data matrix, while the square of each singular value $σ^{2}$ describes the strength of that characteristic.

Eigenvectors and eigenvalues

An eigenvector $v$ of the matrix $A$ is a vector that does not change direction when multiplied by $A$ . $v$ simply gets scaled by a certain amount $λ$ , which is known as the eigenvalue. In other words:

$A v = λ v$

In Linear Algebra, we often find relationships/patterns/characteristics in data by looking at directional vectors. For example, sometimes we look at whether two data properties are positively or negatively correlated; Sometimes, we also check for "similarity" between 2 sets of data by their dot-product.

Eigenvectors are great because it extracts the directional vectors that are inherently found within the matrix, telling us hidden characteristics / relationships between the data points. This importance of eigenvectors is reflected in the German word eigen, which means "characteristic of".

And conveniently, the corresponding eigenvalues inform us about the strength of the relationships / characteristics.

We can use the same steps to derive that $(A A^{T}) U = U Σ^{2}$ , and hence $U$ describes the row-wise characteristics of the data matrix.

Returning to the "Netflix Prize", remember all we had in our data matrix are each user's ratings for each film. We do not know anything else about the users or about the films (e.g. genre). But with SVD, we can sniff out the "eigen-concept" that describes the relationship between the user and the film, and the strength of that relationship.

Imagine the following data matrix of user ratings, in which you have not rated for any shows yet. Please give a rating of 1-10 for the shows "Star Wars" and "Twilight", and we would recommend you your next watch, be it "Star Trek" or "Lovesick", even though we do not know anything else about the shows and our users beyond just the ratings they have given.

Go ahead, give it a try!

Users \ Movies	Star Wars	Star Trek	Twilight	Harry Potter
FantasyLove	{{movieRatings[0][0]}}	{{movieRatings[0][1]}}	{{movieRatings[0][2]}}	{{movieRatings[0][3]}}
LoveSickBoy	{{movieRatings[1][0]}}	{{movieRatings[1][1]}}	{{movieRatings[1][2]}}	{{movieRatings[1][3]}}
GeekGal92	{{movieRatings[2][0]}}	{{movieRatings[2][1]}}	{{movieRatings[2][2]}}	{{movieRatings[2][3]}}
You		{{movieRatings[3][1]}}		{{movieRatings[3][3]}}

Your next recommended watch: {{userRecommendation}}

💡 Note that we can get more accurate results with more data.

The recommendation for your next watch is discovered via SVD. We factorize the data matrix into 3 separate matrices represented by $U Σ V^{T}$ . Based on your inputs, we have shown in the tables below the matrices showing the relationships between users, movies and concepts.

$Σ$ - "concept strength" matrix

Concept 1	Concept 2	Concept 3	Concept 4
{{movieS[0]}}	0.0	0.0	0.0
0.0	{{movieS[1]}}	0.0	0.0
0.0	0.0	~~{{movieS[2]}}~~	0.0
0.0	0.0	0.0	~~{{movieS[3]}}~~

This is the "concept strength" matrix that is represented by $Σ$ . Right now, we do not know what each "concept" might mean, except that it refers to some "property" of the movie (e.g. perhaps "genre")? We can get a better guess when we compare this with the $U$ and $V$ matrices.

For now, let us apply dimensionality reduction by removing concepts that have the least significance (i.e. lowest singular values). In our case, as shown in the table, we shall only keep concepts 1 and 2.

$V^{T}$ - concept to movie matrix

Concept \ Movies	Star Wars	Star Trek	Twilight	Harry Potter
Concept 1	{{movieV[0][0]}}	{{movieV[1][0]}}	{{movieV[2][0]}}	{{movieV[3][0]}}
Concept 2	{{movieV[0][1]}}	{{movieV[1][1]}}	{{movieV[2][1]}}	{{movieV[3][1]}}

This is the "concept to movie" matrix that is represented by $V^{T}$ . We can see that Twilight and Harry Potter weighs heavily on Concept 1, while Star Wars and Star Trek weighs heavily on Concept 2.

Note that we do not know what Concept 1 or Concept 2 represents. It could represent genre or popularity, or even both!

$U$ - user to concept matrix

Users \ Concept	Concept 1	Concept 2
FantasyLove	{{movieU[0][0]}}	{{movieU[0][1]}}
LoveSickBoy	{{movieU[1][0]}}	{{movieU[1][1]}}
GeekGal92	{{movieU[2][0]}}	{{movieU[2][1]}}
You	{{movieU[3][0]}}	{{movieU[3][1]}}

This is the "user to concept" matrix that is represented by $U$ . Interestingly, we can see that FantasyLove and LoveSickBoy share the same taste for movies belonging to Concept 1, whereas GeekGal likes movies belonging to Concept 2.

And you? {{userConcept}}

Solving linear regression with SVD

You have probably learnt about drawing "best-fit lines" (i.e. linear regression) in Math/Science class in Secondary/middle school. This is an important concept as it helps us build a predictive model that describes the relationship between a dependent variable and its independent variables (for e.g., predicting housing prices based on proximity distances and land area, or stock prices from past prices and trading volume, etc.). As you can see, what we are about to cover has applications in Machine Learning.

Finding the best-fit line analytically was easy, as we could simply use human-sight to estimate a line that has the least average distance from every data point. However, it is impossible for a computer to do that (unless you make this into a Computer Vision problem). The computer can only calculate the best-fit line numerically (this is also known as solving the "linear least squares problem"). And as it turns out, we can solve this problem with, you guess it, SVD!

What is "least-squares solution"?

The least-squares solution is an estimated solution of an overdetermined system of linear equations (i.e. the number of equations is more than the number of variables, and so A is "tall"). In this estimated solution, the sum of squared errors is minimized.

Here is an example of SVD being used to solve a linear least squares problem. Click/tap on the graph to add new data points and watch the program find the best-fit line!

How is this done? First, let's get a hint from looking at the points you've just drawn in the interactive example - have you noticed that we are actually drawing a line along the direction with the biggest variance among the data points (i.e. the direction with the biggest spread)?

Taking a look at what we learn from "Dimensionality Reduction", we know that each singular value calculated from SVD shows the variance of points along each principle direction. So, let's form our data matrix (let's call it $A$ ) from the data points you drew, and then work towards applying SVD on this data matrix.

$A = [\begin{matrix} x_{1} & y_{1} \\ x_{2} & y_{2} \\ ... & ... \\ x_{n} & y_{n} \end{matrix}]$

Remember that we are able to use SVD to analyze the rotational and scalar components of a data matrix, but it does not account for "translation". So we need to remove the translational component from our data by centering its data points around their mean. This will give us a new data matrix $A'$ .

$A' = [\begin{matrix} x_{1} - \overline{x} & y_{1} - \overline{y} \\ x_{2} - \overline{x} & y_{2} - \overline{y} \\ ... & ... \\ x_{n} - \overline{x} & y_{n} - \overline{y} \end{matrix}]$

where $\overline{x} = \sum_{i = 1}^{n} \frac{x_{i}}{n}$ and $\overline{y} = \sum_{i = 1}^{n} \frac{y_{i}}{n}$

For clarity, here is a code snippet:


// Finding the mean data point
let meanX = 0, meanY = 0;
for(let i = 0; i < data.length ; i++) {
    meanX += data[i][0];
    meanY += data[i][1];
}
meanX /= data.length;
meanY /= data.length;

// Building the new data matrix A' where points are centered around the mean data point
const centeredData = [];
for (let i = 0; i < data.length; i++) {
    centeredData.push([data[i][0] - meanX, data[i][1] - meanY]);
}

Following that, we decompose this data matrix using SVD to get $U Σ V^{T}$ .

$A' = U Σ V^{T}$
$=$ you have not inserted enough points in the interactive example $= [\begin{matrix} {{s4U[0][0]}} & {{s4U[0][1]}} \\ {{s4U[1][0]}} & {{s4U[1][1]}} \end{matrix}] [\begin{matrix} {{s4Q[0][0]}} \\ {{s4Q[0][1]}} \end{matrix}] [\begin{matrix} {{s4V[0][0]}} & {{s4V[1][0]}} \\ {{s4V[1][0]}} & {{s4V[1][1]}} \end{matrix}]$

We want to select the right singular vector in $V$ that has the highest corresponding singular value in $Σ$ (highlighted in red), because it is the direction vector / eigenvector along which the data points has the biggest variance!

Why do we choose a vector in V and not U?

If you remember our discussion on finding correlations using SVD, the right singular vectors (column vectors in V) represents the column-wise characteristics in the data matrix, while the left singular vectors (column vectors in U) represents the row-wise characteristics.

We want to look at "column-wise characteristics" because each column in the data matrix represents a unique feature (in our case, "x" or "y"), and we are looking for the feature along which our data points have the biggest variance.

The chosen right singular vector is a direction vector / eigenvector that describes the relationship between the data points with respect to the chosen unique feature. Thus, it informs us of the gradient of the best-fit line.

P.S. The column space of a data matrix is also known as its "feature space".

Now, we can simply define the best-fit line using the equation:

$y - \overline{y} = (\frac{v_{y}}{v_{x}}) (x - \overline{x} >)$

The steps we have described is also known as Principle Component Analysis.

Solving linear equations with SVD

Let's look at solving the following system of linear equations:

${\begin{matrix} a + 7 b + 3 c = 0 \\ 2 a + 4 b + c = 0 \\ 4 a + 8 b + 6 c = 0 \end{matrix}$

This can be forumulated in the following form:

$A x = 0 ⟹ [\begin{matrix} 1 & 7 & 3 \\ 2 & 4 & 1 \\ 4 & 8 & 6 \end{matrix}] [\begin{matrix} a \\ b \\ c \end{matrix}] = [\begin{matrix} 0 \\ 0 \\ 0 \end{matrix}]$

This is equivalent to saying that we are looking for the nullspace of A.

What is a "span"?

The span of a set of vectors is defined as the collection of all possible linear combinations of those vectors.

What is a "nullspace"?

The "nullspace" of a matrix is a set of vectors (i.e. a span) that describes all possible linear combinations of those vectors that, when transformed by the matrix, would give the zero vector.

Since A is full-ranked and square, and hence invertible:

$x = A^{-1} 0$
$x = 0$

That is easy! We see that our linear system only has a trivial solution, i.e., the zero vector is the only vector in the nullspace of A.

However, what if A is not full-ranked and hence not invertible? Consider the following linear system:

${\begin{matrix} a + b + 0.25 c = 0 \\ 2 a + 4 b + c = 0 \\ 4 a + 8 b + 2 c = 0 \end{matrix}$

where $A = [\begin{matrix} 1 & 1 & 0.25 \\ 2 & 4 & 1 \\ 4 & 8 & 2 \end{matrix}]$

Since A is not full-ranked (the last 2 columns are linearly dependent), A is not invertible. How then do we calculate the nullspace of A?

The answer - we shall use SVD on A, and the nullspace of A shall be the set of right singular vectors that has zero as their singular values!

But how did we arrive at this conclusion? Let's break this down together.

$A = U Σ V^{T}$

$A = [\begin{matrix} {{ffsNullU[0][0]}} & {{ffsNullU[0][1]}} & {{ffsNullU[0][2]}} \\ {{ffsNullU[1][0]}} & {{ffsNullU[1][1]}} & {{ffsNullU[1][2]}} \\ {{ffsNullU[2][0]}} & {{ffsNullU[2][1]}} & {{ffsNullU[2][2]}} \end{matrix}] [\begin{matrix} {{ffsNullQ[0][0]}} \\ {{ffsNullQ[0][1]}} \\ {{ffsNullQ[0][2]}} \end{matrix}] [\begin{matrix} {{ffsNullV[0][0]}} & {{ffsNullV[1][0]}} & {{ffsNullV[2][0]}} \\ {{ffsNullV[0][1]}} & {{ffsNullV[1][1]}} & {{ffsNullV[2][1]}} \\ {{ffsNullV[0][2]}} & {{ffsNullV[1][2]}} & {{ffsNullV[2][2]}} \end{matrix}]$

$A = [\begin{matrix} u_{1} & u_{2} & u_{3} \end{matrix}] [\begin{matrix} σ_{1} \\ σ_{2} \\ 0 \end{matrix}] [\begin{matrix} {v_{1}}^{T} \\ {v_{2}}^{T} \\ {v_{3}}^{T} \end{matrix}]$

$A [\begin{matrix} v_{1} & v_{2} & v_{3} \end{matrix}] = [\begin{matrix} u_{1} & u_{2} & u_{3} \end{matrix}] [\begin{matrix} σ_{1} \\ σ_{2} \\ 0 \end{matrix}]$

$A [\begin{matrix} v_{1} & v_{2} & v_{3} \end{matrix}] = [\begin{matrix} σ_{1} u_{1} & σ_{2} u_{2} & 0 . u_{3} \end{matrix}]$

$A v_{3} = 0$

We see that the nullspace of A = span { $v_{3}$ }, because $v_{3}$ is the only vector in the SVD that can become the zero vector when transformed by A. This shows that the right singular vectors, with corresponding singular values that are 0, form the nullspace of A.

Here's another way to arrive at the same conclusion

Using the same example matrix A:

$A x = 0$
$U Σ V^{T} x = 0$
$Σ V^{T} x = 0$

$[\begin{matrix} σ_{1} \\ σ_{2} \\ 0 \end{matrix}] [\begin{matrix} {v_{1}}^{T} \\ {v_{2}}^{T} \\ {v_{3}}^{T} \end{matrix}] x = 0$

$[\begin{matrix} σ_{1} v_{1} x & σ_{2} v_{2} x & 0 . v_{3} x \end{matrix}] = 0$

As σ₁ ≠ 0 and σ₂ ≠ 0, that means dot(v₁, x) = 0 and dot(v₂, x) = 0, which implies that x has to be orthogonal to both v₁ and v₂.

And as we know v₃ is the only other vector that is orthogonal to both v₁ and v₂, that means x = span { v₃ }

Alright, now let us look at another similar problem.

Solve $A x = c$ , where $c \neq 0$ .

We can solve for $x$ using SVD.

$A x = U Σ V^{T} x$

$A x = [\begin{matrix} u_{1} & u_{2} & u_{3} \end{matrix}] [\begin{matrix} σ_{1} \\ σ_{2} \\ 0 \end{matrix}] [\begin{matrix} {v_{1}}^{T} \\ {v_{2}}^{T} \\ {v_{3}}^{T} \end{matrix}] x$

$A x = [\begin{matrix} u_{1} & u_{2} & u_{3} \end{matrix}] [\begin{matrix} σ_{1} v_{1} . x & σ_{2} v_{2} . x & 0 v_{3} . x \end{matrix}]$

$A x = a u_{1} + b u_{2}$ , where a, b ∈ ℝ

This shows that $A x = span {u_{1}, u_{2}}$

And since we knew that

$A [\begin{matrix} v_{1} & v_{2} & v_{3} \end{matrix}] = [\begin{matrix} σ_{1} u_{1} & σ_{2} u_{2} & 0 . u_{3} \end{matrix}]$

...which means...

$A v_{1} = σ_{1} u_{1}$ and

$A v_{2} = σ_{2} u_{2}$

From this, we can infer that $x = span {v_{1}, v_{2}}$ , (i.e., spanning the right singular vectors where the singular value is not zero) because any linear combination of $v_{1}$ and $v_{2}$ , when transformed by $A$ , will fall on the plane with its basis described by $span {u_{1}, u_{2}}$ .

The Moore-Penrose Pseudoinverse

Lastly, there is another way we can use SVD to solve linear systems, by deriving what we call the Moore-Penrose Pseudoinverse - $A^{+}$ .

Again, let us consider a linear system of the form:

$A x = b$

We can solve for $x$ using the form:

$x = A^{+} b = V Σ^{+} U^{T} b$

where $A^{+} = V Σ^{+} U^{T}$ is the pseudoinverse.

To calculate the pseudoinverse $A^{+}$ , we can obtain $U$ and $V$ via SVD, and obtain $Σ^{+}$ by replacing each non-zero singular value $σ$ in $Σ$ with it's reciprocal (i.e. $1$ / $σ$ ).

If it happens that $A$ is invertible, you will find that $A^{-1} = A^{+}$

However, if $A$ is overdetermined (i.e., height > width), you'll get a least-squares solution, and if $A$ is underdetermined (i.e., width > height) you'll get a minimum-norm solution, both helpful in estimating a solution for $x$ .

What is a minimum-norm solution?

When a system of linear equations is underdetermined (i.e., A's width is bigger than it's height, meaning there are more variables than there are equations), we will find that there are infinitely many solutions.

The minimum-norm solution just refers to the solution that has the smallest Euclidean norm, i.e. the solution that is nearest to the origin.

Calculating the SVD of a matrix

After all the talk about various interpretation and application of SVD, we have not even considered how we may calculate the SVD of a matrix! Of course, if you are writing for software, there are already math libraries that do this for any language of your choosing.

Python - numpy.linalg.svd
Matlab - svd(A)
Javascript - SVD-JS by Danilo Salvati, as explained in "Singular Value Decomposition and Least Squares Solutions. By G.H. Golub et al."
C# - MathNet.Numerics

We will not be going into the details of the algorithms for computing the SVD of a given matrix. Perhaps that can be a separate blog post of its own.

Thank you for reading! It takes time to create content such as this. If you'd like to support free and open education, consider dropping a tip!