Thursday, 25 December 2025

2 Dimension reduction to 1 Dimension using Principal Component Analysis

Understanding Eigenvalues and Principal Component Analysis (PCA)

What Are Eigenvalues?Imagine a square matrix as a "transformation machine" that takes vectors (arrows in space) as input and spits out transformed vectors. Most directions get twisted or bent, but there are special directions—called eigenvectors—that pass through unchanged except for being stretched or shrunk (or flipped).The eigenvalue (λ) is the scaling factor that tells you how much the eigenvector is stretched or shrunk:
Why do they matter?

  • In real applications like Google’s PageRank, the largest eigenvalue helps rank web pages by importance.

Eigenvalues (sorted descending):
[1.73716614 0.05354182]

Total variance: 1.79070796
Variance explained by PC1: 1.73716614
Percentage by PC1: 96.99 %


PCA with Explicit Eigenvalues HighlightedIn Principal Component Analysis (PCA), eigenvalues play a central role:
  • We compute the eigen decomposition of the covariance matrix.
  • The eigenvectors give the directions of the new axes (principal components).
  • The eigenvalues tell us how much variance (spread) each principal component captures.
  • We sort by descending eigenvalues and keep the top ones — here, the largest eigenvalue corresponds to PC1, capturing ~97% of the total variance, so reducing to 1D loses almost nothing.
What Is Principal Component Analysis (PCA)?PCA is a technique to simplify high-dimensional data while keeping as much information as possible. It does this by finding new axes (principal components) aligned with the directions of greatest variance.Here’s how PCA works step by step:
  1. Center the data — subtract the mean so the cloud is centered at the origin.
  2. Compute the covariance matrix — measures how features vary together.
  3. Perform eigen decomposition on the covariance matrix.
    • The eigenvectors become the new axes (principal components).
    • The eigenvalues tell you how much variance each axis captures.
  4. Sort by eigenvalues (largest first) and project the data onto the top k components.
The first principal component (PC1) is the direction of maximum spread. PC2 is the next, perpendicular to PC1, and so on.
Why use PCA?
  • Dimensionality reduction: Turn 1000 features into 50 without losing much information.
  • Visualization: Plot high-dimensional data in 2D or 3D.
  • Noise removal: Small eigenvalues often correspond to noise; dropping them cleans the data.
  • Speed: Fewer dimensions make machine learning models faster and less prone to overfitting.
A classic example is facial recognition ("eigenfaces"). Thousands of pixel values per image are reduced to a handful of principal components that capture the main variations (lighting, expression, pose), allowing efficient storage and comparison.The Key Connection: Eigenvalues Power PCAEigenvalues are the heart of PCA. They quantify "importance":
  • Large eigenvalue → that direction explains a lot of the data’s variability → keep it.
  • Small eigenvalue → little information → safe to discard.
In practice, you might keep enough components to explain 95% of the total variance (sum of all eigenvalues).
Final ThoughtsEigenvalues help us understand the core scaling behavior of linear transformations, while PCA uses them to intelligently compress and reveal structure in data. Together, they’re essential tools in data analysis, machine learning, image processing, and even physics.Next time you hear about "reducing dimensions" or "finding principal directions," you’ll know it’s eigenvalues doing the heavy lifting behind the scenes!

Python code from Grok

import numpy as np
import matplotlib.pyplot as plt

# Dataset
data = np.array([[2.5, 2.4],
                 [0.5, 0.7],
                 [2.2, 2.9],
                 [1.9, 2.2],
                 [3.1, 3.0]])

# Center the data
mean = np.mean(data, axis=0)
centered_data = data - mean

# Covariance matrix and eigen decomposition
cov_matrix = np.cov(data.T)
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Sort by largest eigenvalue
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Print eigenvalues and variance explained
print("Eigenvalues (sorted descending):")
print(eigenvalues)
print("\nTotal variance:", eigenvalues.sum())
print("Variance explained by PC1:", eigenvalues[0])
print("Percentage by PC1:", 100 * eigenvalues[0] / eigenvalues.sum(), "%")

# First principal component
pc1 = eigenvectors[:, 0]

# Project onto PC1 (1D scores)
projected_scores = centered_data @ pc1

# Reconstructed points in original 2D space from 1D projection
reconstructed_2d = np.outer(projected_scores, pc1) + mean

# Plot
fig, axs = plt.subplots(1, 2, figsize=(16, 7))

# Left panel: Original data with PC1 and PC2 directions (both eigenvectors)
axs[0].scatter(data[:, 0], data[:, 1], color='blue', s=100, label='Original data points')
axs[0].scatter(mean[0], mean[1], color='red', marker='X', s=250, label='Mean (center)')
scale = 2.5
axs[0].arrow(mean[0], mean[1], eigenvectors[0,0]*scale, eigenvectors[1,0]*scale,
             head_width=0.15, head_length=0.2, fc='green', ec='green', linewidth=4,
             label=f'PC1 (eigenvalue ≈ {eigenvalues[0]:.3f})')
axs[0].arrow(mean[0], mean[1], eigenvectors[0,1]*scale, eigenvectors[1,1]*scale,
             head_width=0.15, head_length=0.2, fc='orange', ec='orange', linewidth=4,
             label=f'PC2 (eigenvalue ≈ {eigenvalues[1]:.3f})')
axs[0].set_xlabel('X (original feature 1)')
axs[0].set_ylabel('Y (original feature 2)')
axs[0].set_title('Original 2D Data with Principal Components\n(Eigenvectors of Covariance Matrix)')
axs[0].grid(True)
axs[0].legend()
axs[0].axis('equal')

# Right panel: Projections shown in X-Y space
axs[1].scatter(data[:, 0], data[:, 1], color='lightblue', alpha=0.6, s=100, label='Original points')
axs[1].scatter(reconstructed_2d[:, 0], reconstructed_2d[:, 1], color='red', s=100, label='Projected points (1D)')

# Draw the principal axis line (PC1)
t = np.linspace(projected_scores.min() - 1, projected_scores.max() + 1, 100)
line_x = mean[0] + t * pc1[0]
line_y = mean[1] + t * pc1[1]
axs[1].plot(line_x, line_y, color='green', linewidth=4, label='1D Principal Axis (PC1)')

# Draw dashed lines from original to projected points
for i in range(len(data)):
    orig = data[i]
    proj = reconstructed_2d[i]
    axs[1].plot([orig[0], proj[0]], [orig[1], proj[1]], color='gray', linestyle='--', linewidth=1.5)

axs[1].set_xlabel('X (original feature 1)')
axs[1].set_ylabel('Y (original feature 2)')
axs[1].set_title('After PCA: Data Reduced to 1D\n(Large eigenvalue direction captures most variance)')
axs[1].grid(True)
axs[1].legend()
axs[1].axis('equal')

plt.tight_layout()
plt.show()

No comments:

Post a Comment