Home Assignment 3 › ML Course › Clustering & PCA A3 Report Apr 2026
Aryan
Homogeneity: each cluster = one class. Completeness: each class = one cluster. Both ~0.42 means clusters overlap. KMeans wins here.
Aryan

Task 2 — KMeans vs KMedoids

2

You can use dataset-task-2.csv or generate your own using make_blobs. Compare KMeans and KMedoids using homogeneity and completeness scores:

  • Train the models.
  • Make scatterplots of their performance. Also scatterplot the true labels for visual comparison.
  • Your plots.
  • If you made a dataset of your own, how you made it and why.
  • Which value of k you used for each model.
  • Which model performed better? Do you have any idea why?
  • Describe for both models a situation where it would be a better choice than the other.
Answer

Dataset

We used the provided dataset-task-2.csv (125 points, 2 features X0/X1, 4 ground-truth labels 0–3). Label distribution: three classes with 25 points each and one with 50 points — slightly imbalanced.

Code

url = 'https://raw.githubusercontent.com/MLCourse-LU/Datasets/main/dataset-task-2.csv'
df  = pd.read_csv(url, header=0)
X   = df.iloc[:, :-1].values   # features (X0, X1)
y   = df.iloc[:, -1].values    # ground-truth labels (0–3)
k   = 4

km_t2   = KMeans(n_clusters=k, random_state=42, n_init=10)
kmed_t2 = KMedoids(n_clusters=k, random_state=42)
km_t2.fit(X);  kmed_t2.fit(X)

km_hom   = homogeneity_score(y, km_t2.labels_)   # 0.417
km_comp  = completeness_score(y, km_t2.labels_)  # 0.419
kmed_hom = homogeneity_score(y, kmed_t2.labels_) # 0.373
kmed_comp= completeness_score(y, kmed_t2.labels_)# 0.378

How homogeneity & completeness work

Homogeneity measures whether each predicted cluster contains only members of a single ground-truth class (1 = perfectly homogeneous, 0 = totally mixed). Completeness measures whether all members of a ground-truth class are placed in the same predicted cluster (1 = fully captured). Together they form a picture analogous to precision/recall.

Results plot

Task 2 KMeans vs KMedoids

Scores

ModelHomogeneityCompleteness
KMeans0.4170.419
KMedoids0.3730.378

k = 4 was used for both models (matching the 4 ground-truth labels).

Which performed better and why

KMeans performed slightly better. KMeans places its centroid at the arithmetic mean of assigned points — an unconstrained position that adapts to each cluster's geometry. KMedoids constrains its centroid to an actual data point; in regions where clusters overlap, the chosen medoid may be slightly off-centre, slightly mis-aligning the Voronoi boundary and causing some points to be assigned incorrectly.

The class imbalance (one class has 50 vs 25 points) also affects KMedoids more — larger clusters pull medoids toward their densest sub-region, distorting smaller cluster boundaries.

Both scores (~0.42) are only moderate, indicating the 4 classes are not perfectly linearly separable — neither algorithm fully recovers the ground truth.

When each algorithm is the better choice

  • KMeans is better when clusters are roughly spherical, similarly sized, and free of outliers — it converges faster and its unconstrained centroid captures the true geometric centre of each Gaussian blob.
  • KMedoids is better when data contains significant outliers (the medoid is always an interior point, so it isn't pulled away), when the distance metric is non-Euclidean (e.g. Hamming distance on categorical data), or when interpretability matters — the medoid is a real data point and can be shown to stakeholders as a representative example.