A3 Report · Task 2

Task 2 — KMeans vs KMedoids

You can use dataset-task-2.csv or generate your own using make_blobs. Compare KMeans and KMedoids using homogeneity and completeness scores:

Train the models.
Make scatterplots of their performance. Also scatterplot the true labels for visual comparison.
Your plots.
If you made a dataset of your own, how you made it and why.
Which value of k you used for each model.
Which model performed better? Do you have any idea why?
Describe for both models a situation where it would be a better choice than the other.

Answer

Dataset

We used the provided dataset-task-2.csv (125 points, 2 features X0/X1, 4 ground-truth labels 0–3). Label distribution: three classes with 25 points each and one with 50 points — slightly imbalanced.

Code

url = 'https://raw.githubusercontent.com/MLCourse-LU/Datasets/main/dataset-task-2.csv'
df  = pd.read_csv(url, header=0)
X   = df.iloc[:, :-1].values   # features (X0, X1)
y   = df.iloc[:, -1].values    # ground-truth labels (0–3)
k   = 4

km_t2   = KMeans(n_clusters=k, random_state=42, n_init=10)
kmed_t2 = KMedoids(n_clusters=k, random_state=42)
km_t2.fit(X);  kmed_t2.fit(X)

km_hom   = homogeneity_score(y, km_t2.labels_)   # 0.417
km_comp  = completeness_score(y, km_t2.labels_)  # 0.419
kmed_hom = homogeneity_score(y, kmed_t2.labels_) # 0.373
kmed_comp= completeness_score(y, kmed_t2.labels_)# 0.378

How homogeneity & completeness work

Homogeneity measures whether each predicted cluster contains only members of a single ground-truth class (1 = perfectly homogeneous, 0 = totally mixed). Completeness measures whether all members of a ground-truth class are placed in the same predicted cluster (1 = fully captured). Together they form a picture analogous to precision/recall.

Results plot

Scores

Model	Homogeneity	Completeness
KMeans	0.417	0.419
KMedoids	0.373	0.378

k = 4 was used for both models (matching the 4 ground-truth labels).

Which performed better and why

KMeans performed slightly better. KMeans places its centroid at the arithmetic mean of assigned points — an unconstrained position that adapts to each cluster's geometry. KMedoids constrains its centroid to an actual data point; in regions where clusters overlap, the chosen medoid may be slightly off-centre, slightly mis-aligning the Voronoi boundary and causing some points to be assigned incorrectly.

The class imbalance (one class has 50 vs 25 points) also affects KMedoids more — larger clusters pull medoids toward their densest sub-region, distorting smaller cluster boundaries.

Both scores (~0.42) are only moderate, indicating the 4 classes are not perfectly linearly separable — neither algorithm fully recovers the ground truth.

When each algorithm is the better choice

KMeans is better when clusters are roughly spherical, similarly sized, and free of outliers — it converges faster and its unconstrained centroid captures the true geometric centre of each Gaussian blob.
KMedoids is better when data contains significant outliers (the medoid is always an interior point, so it isn't pulled away), when the distance metric is non-Euclidean (e.g. Hamming distance on categorical data), or when interpretability matters — the medoid is a real data point and can be shown to stakeholders as a representative example.