Principal Component Analysis part 1 – Mathematics for Machine Learning

We have been discussing mathematics for machine learning from quite a few blogs now. We have discussed linear algebra and multivariate calculus in detail. Another important mathematical concept used frequently in machine learning is Principal Component Analysis. PCA is used in machine learning for dimensionality reduction in data. In Principal Component Analysis part 1, we will discuss data dimensionality and some statistical representation of data like mean and variance and how these representations change with the data transformation.

Data in the real-world is of high dimensions. Even if we look at the price of house example we find a number of parameters affecting price from the number of bedrooms, the overall size of the house, to the price of neighboring houses, nearby parks, and train stations, etc. Similarly, an image dataset contains an image of 640 px by 480 px as a data point where each pixel has three dimensions of red, green, and blue scale. In short, real-life data has more than one influential parameter to consider.

Working with higher dimensional data has some difficulties. The data becomes difficult to interpret, visualize, and even store. But higher dimensional data also has some helpful properties. Such data is often ‘over complete’. That is many dimensions can be explained by a combination of other dimensions. That is, some dimensions are redundant. Dimensions are often correlated such that data has an inherent lower-dimensional structure. For example, if the image is made of four color channels, red, blue, green, and gray. We can describe the gray channel through the combination of the other three channels.

Dimensionality reduction helps us to take advantage of structure and correlation between data points without the loss of information from data. The lower dimension representation of a higher dimension data is often called feature or a code.

Next in Principal Component Analysis part 1, we will discuss the required mathematical details to understand PCA.

While working with data, it is not practical to carry the whole data set or to give somebody the whole data set to work with. We try to work with the dataset using some of its statistical properties. Such compact ways of describing the dataset help in understanding and working with data easier. We will discuss some of such properties like an average of data points through mean, the spread of data points through variance, and orientation relation between data points through correlation properties.

Computing Mean of a Dataset:

The mean of a dataset is the average point in that dataset. It does not have to be present in the dataset itself. For example, we have a set of images of digit 8. The average of this image dataset is this image.

It contains all the properties of the images present in the dataset but it is not present in the dataset itself. In higher dimensions, the image is represented through a vector by stacking all the pixels. Images are turned into vectors, then all these vectors are added and finally divided by the total number of images to get the average image vector.

If we consider 4 images of digit 8 the mean of the first image is the first image itself. Adding a second image and taking average gives an image containing properties of both the images. Similarly, adding third and fourth images and taking mean gives a more blurry image with properties of all the four images. If we consider all the images in the dataset the final image is this. It is not present in the dataset.

To generalize this concept,

Where E[D] is the expected value of dataset D and N is the number of total data points in D.

Suppose we roll five dices and get 1, 2, 4, 6, and 6.

We can see that 3.8 is not part of our dataset. We can also not even get this value by rolling a dice. This shows that mean value does not have to be an instance in the dataset. It is just the average of the data points present in the dataset.

Computing Variance of One Dimensional Dataset:

Take a look at data points from different datasets in the form of circles and squares.

Data points from dataset D1 are at locations 1,2,4, and 5 and D2 data points are at locations -1, 3, and 7. Both the datasets have the same mean value of 3. But we see that data points of D1 are more concentrated around the mean value of 3 than the D2 data points. To describe this concentration of data points around the mean value we use the property of variance. The variance shows the spread of data points in the dataset.

It is defined as the average squared distance of data points from the mean value of the dataset.

So, the average squared distance of D2 from mean is bigger than the average squared distance of D1 from mean. It means the data points on D2 are more spread than the D1 data points.

Form this formalized form of variance we can conclude that the variance of the dataset cannot be negative. Summing up squared values will always give a positive variance. Secondly, taking the square root of variance will give us a standard deviation. Standard deviation is expressed in the same units as the mean value whereas variance is expressed in squared units. Due to this reason, we normally express the spread of data in terms of standard deviation.

Computing Variance of Higher Dimensional Dataset:

The variance definition that we computed above does not actually help with higher dimensional data. If we consider 2-dimensional data and compute variance in x and y directions it does not much help in expressing data spread.

The figure below shows a dataset in two dimensions and its variance in x and y directions is shown by horizontal and vertical bars.

Now consider other datasets and similar variance computations in x and y directions.

The datasets are quite different but the variance is the same.

The four datasets below are of very different shapes but the variance for them all is the same. If we focus on the horizontal and vertical spread of data separately we can’t define a correlation between X and Y.

In the figure, we can see if x value of a data point increases the y value decreases. This shows that x and y are negatively correlated.

This correlation or covariance of the data can be described as follows.

We can express these four quantities in a covariance matrix.

In positive covariance, if the value of x increases, the value of y also increases. In negative covariance, if x decreases y also decreases. In the case of 0 covariance, x and y have nothing to do with each other. They are uncorrelated.

Generalizing the concept,

Where Mu is the mean of the dataset, xi is a DxD matrix.

Next in Principal Component Analysis part 1, let’s see what happens to mean and variance if we transform the dataset.

Effect of Linear Transformation on the Mean:

How linearly transforming the dataset effect its mean value? That is if we stretch it or shift it.

Consider a dataset with data points at -1, 2, and 3. Its mean value or expected value is 1.33, shown as a star in the figure.

Now we shift the dataset by 2. We get data points shown as red circles. What happens to the mean of the data set? Well, it gets shifted by value 2 too.

Now, what happens if we stretch the dataset? If we stretch the dataset by value 2 it means we are multiplying every datapoint by a value of 2.

Effect of Linear Transformation on the Covariance:

Variance describes the spread of the data. What happens if we shift the dataset? Consider the data set. Variance is given by a bar.

Now, we shift the dataset towards the right by 2, indicated by red squares.

What we see is that relation between the data points themselves has not changed. Meaning the variance does not change. Variance among data points before the shift and after the shift is the same.

What if we stretch or scale the same dataset? Consider the same dataset and scale each data point by 2.

Variance is the average squared distance of data points from mean. If we scale the data points by 2, their distance from the mean is also scaled by 2 but the squared distance is scaled by 4. Variance thus increases by 4 times.

Variance of a dataset is expressed as a covariance matrix. Consider the following dataset D.

If we linearly transform every data point, say for matrix A by an offset vector b, we get a covariance matrix of the transformed dataset as follows,

We can conclude for higher dimensions, shifting only affects the mean of the dataset, not the variance whereas scaling the dataset affects both it’s mean and variance.

More on mean, variance, and correlation here and here!

Wrapping Up:

In Principal Component Analysis part 1, we discussed data dimensionality and some statistical properties of a dataset such as mean and variance. We also looked at how these properties change when we scale or shift the dataset. These mathematical learning will further help us in implementing the Principal Component Analysis algorithm for dimensionality reduction of the data. See you in the next part!