Principal Component Analysis (PCA): Theory, Mathematics, and Applications

Author(s): Praveen Bhavani Originally published on Towards AI. Principal Component Analysis (PCA) is one of the most widely used techniques for dimensionality reduction and feature extraction. PCA transforms correlated variables into a smaller set of uncorrelated variables called principal components, while preserving as much information (variance) as possible. PCA is fundamentally a linear algebra and statistical method rooted in: Covariance structure analysis Orthogonal transformations Eigenvalue decomposition Variance maximization Modern datasets often contain hundreds or thousands of correlated variables. In finance, quantitative trading, computer vision, genomics, and machine learning, high-dimensional data creates several challenges: Computational inefficiency Noise accumulation Multicollinearity Overfitting Difficulty in visualization and interpretation Conceptual Foundation of PCA Consider a dataset of n observations measured across p features, where those features are often correlated with one another — as height and weight tend to be in a population study, or as pixel intensities tend to be in neighboring regions of an image. Working directly in this high-dimensional, correlated space is statistically inefficient: many of the p dimensions carry redundant information, and the sheer number of variables can obscure the underlying structure we actually care about. PCA addresses this by asking a deceptively simple question: is there a new coordinate system in which the data’s variability becomes easier to see? Rather than describing each observation through the original features x₁, x₂, …, xₚ, PCA constructs a new set of axes the principal components PC₁, PC₂, …, PCₚ that are rotations of the original coordinate space. These axes are chosen sequentially and according to a strict rule: each one must point in the direction of greatest remaining variance in the data, subject to being orthogonal to every component that came before it. This gives rise to a natural ordering. PC₁ is the single direction through the data that explains more variance than any other possible axis — it is, in a precise geometric sense, the line along which the data is most spread out. PC₂ then captures as much of the residual variance as possible, after the contribution of PC₁ has been removed, and it does so at a right angle to PC₁. PC₃ repeats this logic relative to the first two components, and so on down the chain. The orthogonality constraint is not merely a geometric nicety — it is what guarantees that the principal components are uncorrelated with one another. Information captured by PC₁ is completely absent from PC₂, which means each component contributes something genuinely new to the description of the data. This stands in direct contrast to the original features, which may be entangled by correlations that make it hard to attribute variance to any single variable cleanly. The practical payoff is that the bulk of the dataset’s variance is typically concentrated in the first few principal components. A dataset with fifty correlated features might have 90% of its variance sitting in just three or four PCs. Those components can then stand in for the full feature set — dramatically reducing dimensionality while preserving the structure that matters most for analysis, visualization, or downstream modeling. Geometric Interpretation The algebra of PCA — eigendecompositions, covariance matrices, orthogonal projections — can feel abstract until you see what it is actually doing to the data. Geometrically, PCA is a rotation. Nothing is stretched, nothing is discarded, and no information is destroyed. The coordinate axes simply pivot to a new orientation, one chosen to align with the natural shape of the data rather than with the arbitrary axes of the original measurement space. To make this concrete, consider two financial variables measured daily over a year: a stock’s return and its trading volume. Plotted as a scatter of points, this data rarely forms a tidy horizontal or vertical band. Returns and volume tend to move together — high-volume days often coincide with large price swings — so the cloud of points stretches diagonally across the plane, oriented somewhere between the two original axes. The original coordinate system, with its horizontal “return” axis and vertical “volume” axis, cuts across the data at an oblique angle. It describes the cloud accurately, but inefficiently: both variables are needed to characterize even the dominant pattern. PCA identifies that diagonal direction and makes it the first axis. PC₁ runs along the longest dimension of the cloud — the direction in which the data is most spread out. A single number along this axis tells you more about where a given day sits in the distribution than either the return or the volume measurement alone. The second axis, PC₂, is then placed at a right angle to PC₁, pointing across the narrow dimension of the cloud and capturing whatever residual variation remains after the dominant trend is accounted for. The result of this rotation is threefold. First, it compresses information: the dominant structure of the data, which previously required two coordinates to describe, is now legible in one. Second, it removes redundancy: because the original variables were correlated, they were partly saying the same thing twice; the rotated axes separate that shared signal from the independent variation each variable carries. Third, it decorrelates the features: by construction, the projections of the data onto PC₁ and PC₂ have zero correlation — the axes are orthogonal, so knowing a point’s position along one component tells you nothing about its position along the other. What PCA does not do is change the data itself. Every point occupies the same position in space before and after the transformation; only the rulers used to measure that position have changed. This is why PCA is lossless when all components are retained — and why choosing to keep only the first k components is a deliberate, interpretable act of compression rather than an accidental distortion. The Data Matrix The starting point for any rigorous treatment of PCA is a precise description of the data it operates on. We represent our dataset as a matrix X ∈ ℝⁿˣᵖ, where n is the number of observations and p is the number of measured variables. Each of […]