Home Assignment 3 › ML Course › Clustering & PCA A3 Report Apr 2026
Aryan
PCA = find directions of max variance. Scale first. k=1 for 4a (88%), k=2 for 4b (82%). Always scale before PCA.
Aryan

Task 6 — PCA & Dimensionality Reduction

6a

Question 6a: Why can the two-dimensional data stored in X be reconstructed so well using only one principal component?

(X = [[0.0, 0.4], [1.0, 2.0], [2.0, 3.2], [3.0, 5.3]])

Answer

The 4 data points are nearly perfectly collinear: feature 2 is approximately a linear function of feature 1 (X₂ ≈ 1.63·X₁ + 0.4). Geometrically, these points lie very close to a single straight line in 2D space.

PCA finds the directions of maximum variance, ordered from most to least. When data lies near a 1D line:

  • PC1 aligns with that line and captures almost all variance (R² ≈ 1.0 for this dataset).
  • PC2 is orthogonal to that line and captures only the tiny residual noise — essentially zero.

Projecting onto PC1 and reconstructing back to 2D therefore reproduces the original data with negligible error (MSE ≈ 0).

General principle: PCA succeeds at dimensionality reduction when features are highly correlated. If d features lie near a k-dimensional linear subspace (k < d), PCA's first k components capture the true structure and the remaining d−k components capture only noise. Here, 2D data lies near a 1D line, so k=1 suffices.

6b

Question 6b: Use PCA to visualize datasets 4a and 4b. Use the smallest dimension that explains at least 80% of total variance.

  • Figures of reduced dimension data for datasets 4a and 4b, with the variance each PC explains in axis labels.
  • A table of R-squared and MSE for different dimensions in each dataset.
  • Which dimension should be selected to explain the most data while still being able to plot the result?
Answer

Code

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale as sk_scale
from sklearn.metrics import mean_squared_error as mse_fn

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for ax, (dname, fname) in zip(axes,
        [('4a','dataset-task-4a.csv'),('4b','dataset-task-4b.csv')]):
    X_pca    = sk_scale(pd.read_csv(fname).values)   # centre + unit variance
    pca_full = PCA().fit(X_pca)
    evr      = pca_full.explained_variance_ratio_
    cumevr   = np.cumsum(evr)
    k_min    = int(np.argmax(cumevr >= 0.80)) + 1     # smallest k ≥ 80%

    # R² / MSE table
    for k_iter in range(1, len(evr)+1):
        pca_k = PCA(n_components=k_iter)
        sc    = pca_k.fit_transform(X_pca)
        r2    = float(np.sum(evr[:k_iter]))
        mse   = mse_fn(X_pca, sc @ pca_k.components_)
        print(f'k={k_iter}  R²={r2:.4f}  MSE={mse:.4f}')

    # Visualise with k_min components
    sc_vis = PCA(n_components=k_min).fit_transform(X_pca)
    if k_min == 1:
        ax.scatter(sc_vis[:,0], np.zeros(len(sc_vis)), alpha=0.7, s=40)
        ax.set_xlabel(f'PC 1 ({evr[0]*100:.1f}% variance)')
    else:
        ax.scatter(sc_vis[:,0], sc_vis[:,1], alpha=0.7, s=40)
        ax.set_xlabel(f'PC 1 ({evr[0]*100:.1f}% variance)')
        ax.set_ylabel(f'PC 2 ({evr[1]*100:.1f}% variance)')
    ax.set_title(f'Dataset {dname}: {k_min}D PCA ({cumevr[k_min-1]*100:.1f}% var.)')
plt.tight_layout();  plt.show()

Why we scale first

sk_scale centres each feature on 0 and scales to unit variance. Without this, PCA would be dominated by whichever feature has the largest numeric range — not necessarily the most informative one. Scaling ensures all features are considered equally by PCA.

Result

Task 6 PCA

R² and MSE tables

Dataset 4a (3 features):

kR² (cumulative)MSE
10.88130.1187← ≥80%
20.97530.0247
31.00000.0000full

Dataset 4b (5 features):

kR² (cumulative)MSE
10.51630.4837
20.82100.1790← ≥80%
30.92100.0790
40.99020.0098
51.00000.0000full

Which dimension to select?

DatasetMin k ≥ 80% var.Recommended k for plottingReason
4ak=1 (88.1%)k=2 (97.5%)2D scatter more informative than 1D; cost = +9.4% info at zero plot complexity
4bk=2 (82.1%)k=2 (82.1%)Already ≥80% and directly plotable as 2D scatter

Dataset 4a: k=1 technically satisfies the 80% threshold (88.1%), but a single-axis plot is much less useful than a 2D scatter. Using k=2 captures 97.5% of variance while producing a standard 2D visualisation — a clear win for minimal extra cost.

Dataset 4b: k=2 exactly satisfies the threshold (82.1%) and directly produces a 2D scatter plot. Adding k=3 gives 92.1% but is no longer directly plotable without further reduction. Therefore k=2 is optimal for both datasets when the goal is to maximise explained variance while remaining plottable.