A3 Report · Task 6

Task 6 — PCA & Dimensionality Reduction

Question 6a: Why can the two-dimensional data stored in X be reconstructed so well using only one principal component?

(X = [[0.0, 0.4], [1.0, 2.0], [2.0, 3.2], [3.0, 5.3]])

Answer

The 4 data points are nearly perfectly collinear: feature 2 is approximately a linear function of feature 1 (X₂ ≈ 1.63·X₁ + 0.4). Geometrically, these points lie very close to a single straight line in 2D space.

PCA finds the directions of maximum variance, ordered from most to least. When data lies near a 1D line:

PC1 aligns with that line and captures almost all variance (R² ≈ 1.0 for this dataset).
PC2 is orthogonal to that line and captures only the tiny residual noise — essentially zero.

Projecting onto PC1 and reconstructing back to 2D therefore reproduces the original data with negligible error (MSE ≈ 0).

General principle: PCA succeeds at dimensionality reduction when features are highly correlated. If d features lie near a k-dimensional linear subspace (k < d), PCA's first k components capture the true structure and the remaining d−k components capture only noise. Here, 2D data lies near a 1D line, so k=1 suffices.

Question 6b: Use PCA to visualize datasets 4a and 4b. Use the smallest dimension that explains at least 80% of total variance.

Figures of reduced dimension data for datasets 4a and 4b, with the variance each PC explains in axis labels.
A table of R-squared and MSE for different dimensions in each dataset.
Which dimension should be selected to explain the most data while still being able to plot the result?

Answer

Code

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale as sk_scale
from sklearn.metrics import mean_squared_error as mse_fn

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for ax, (dname, fname) in zip(axes,
        [('4a','dataset-task-4a.csv'),('4b','dataset-task-4b.csv')]):
    X_pca    = sk_scale(pd.read_csv(fname).values)   # centre + unit variance
    pca_full = PCA().fit(X_pca)
    evr      = pca_full.explained_variance_ratio_
    cumevr   = np.cumsum(evr)
    k_min    = int(np.argmax(cumevr >= 0.80)) + 1     # smallest k ≥ 80%

    # R² / MSE table
    for k_iter in range(1, len(evr)+1):
        pca_k = PCA(n_components=k_iter)
        sc    = pca_k.fit_transform(X_pca)
        r2    = float(np.sum(evr[:k_iter]))
        mse   = mse_fn(X_pca, sc @ pca_k.components_)
        print(f'k={k_iter}  R²={r2:.4f}  MSE={mse:.4f}')

    # Visualise with k_min components
    sc_vis = PCA(n_components=k_min).fit_transform(X_pca)
    if k_min == 1:
        ax.scatter(sc_vis[:,0], np.zeros(len(sc_vis)), alpha=0.7, s=40)
        ax.set_xlabel(f'PC 1 ({evr[0]*100:.1f}% variance)')
    else:
        ax.scatter(sc_vis[:,0], sc_vis[:,1], alpha=0.7, s=40)
        ax.set_xlabel(f'PC 1 ({evr[0]*100:.1f}% variance)')
        ax.set_ylabel(f'PC 2 ({evr[1]*100:.1f}% variance)')
    ax.set_title(f'Dataset {dname}: {k_min}D PCA ({cumevr[k_min-1]*100:.1f}% var.)')
plt.tight_layout();  plt.show()

Why we scale first

sk_scale centres each feature on 0 and scales to unit variance. Without this, PCA would be dominated by whichever feature has the largest numeric range — not necessarily the most informative one. Scaling ensures all features are considered equally by PCA.

Result

R² and MSE tables

Dataset 4a (3 features):

k	R² (cumulative)	MSE
1	0.8813	0.1187	← ≥80%
2	0.9753	0.0247
3	1.0000	0.0000	full

Dataset 4b (5 features):

k	R² (cumulative)	MSE
1	0.5163	0.4837
2	0.8210	0.1790	← ≥80%
3	0.9210	0.0790
4	0.9902	0.0098
5	1.0000	0.0000	full

Which dimension to select?

Dataset	Min k ≥ 80% var.	Recommended k for plotting	Reason
4a	k=1 (88.1%)	k=2 (97.5%)	2D scatter more informative than 1D; cost = +9.4% info at zero plot complexity
4b	k=2 (82.1%)	k=2 (82.1%)	Already ≥80% and directly plotable as 2D scatter

Dataset 4a: k=1 technically satisfies the 80% threshold (88.1%), but a single-axis plot is much less useful than a 2D scatter. Using k=2 captures 97.5% of variance while producing a standard 2D visualisation — a clear win for minimal extra cost.

Dataset 4b: k=2 exactly satisfies the threshold (82.1%) and directly produces a 2D scatter plot. Adding k=3 gives 92.1% but is no longer directly plotable without further reduction. Therefore k=2 is optimal for both datasets when the goal is to maximise explained variance while remaining plottable.