A3 Report · Task 3

Task 3 — Silhouette Comparison

Adapt the silhouette code fragment to compare k-means and k-medoids:

Load the data from dataset-task-3.csv.
Make a clustering using KMeans and another using KMedoids, both with k=4.
Adapt the code to compare clusterings from both models.

Submit: A silhouette + clustering figure for both KMeans and KMedoids. Report the scores. Analyze which works better and why.

Answer

Code

url_t3 = 'https://raw.githubusercontent.com/MLCourse-LU/Datasets/main/dataset-task-3.csv'
df_t3 = pd.read_csv(url_t3)
X_t3  = df_t3.values
k_t3  = 4

for model_name, clusterer in [
        ('KMeans',   KMeans(n_clusters=4, random_state=42, n_init=10)),
        ('KMedoids', KMedoids(n_clusters=4, random_state=42))]:

    labels = clusterer.fit_predict(X_t3)
    sil_avg = silhouette_score(X_t3, labels)
    sil_vals = silhouette_samples(X_t3, labels)

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

    # Left: silhouette bars per cluster
    y_lower = 10
    for i in range(4):
        vals = np.sort(sil_vals[labels == i])
        y_upper = y_lower + len(vals)
        color = cm.nipy_spectral(float(i) / 4)
        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, vals,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ax1.text(-0.05, y_lower + 0.5*len(vals), str(i))
        y_lower = y_upper + 10
    ax1.axvline(x=sil_avg, color='red', linestyle='--', label=f'Avg={sil_avg:.3f}')
    ax1.set_xlabel('Silhouette coefficient');  ax1.legend()

    # Right: scatter of clusters + centroids
    colors_sc = cm.nipy_spectral(labels.astype(float) / 4)
    ax2.scatter(X_t3[:,0], X_t3[:,1], c=colors_sc, s=30, alpha=0.7, edgecolor='k')
    for i, c in enumerate(clusterer.cluster_centers_):
        ax2.scatter(c[0], c[1], marker=f'${i}$', s=50, edgecolor='k', zorder=6)

    plt.suptitle(f'{model_name} Silhouette (k=4) — avg = {sil_avg:.4f}')
    plt.tight_layout();  plt.show()

How silhouette analysis works

The silhouette coefficient for point i is:

s(i) = (b(i) − a(i)) / max(a(i), b(i))

where a(i) = mean distance to points in same cluster (cohesion) and b(i) = mean distance to points in the nearest other cluster (separation). Score near +1 → well separated; near 0 → on boundary; negative → likely misassigned. The overall average score summarises global cluster quality.

The silhouette plot shows sorted bars for each cluster — wide, positive bars indicate tight, well-separated clusters. The red dashed line marks the average score.

KMeans result (avg silhouette = 0.432)

KMedoids result (avg silhouette = 0.283)

Scores

Model	Average Silhouette Score
KMeans	0.432
KMedoids	0.283

Analysis — why KMeans wins here

Dataset-task-3 contains approximately round, balanced, well-separated blobs — the ideal shape for KMeans. KMeans minimises within-cluster sum of squared distances, which is equivalent to finding the optimal partition for spherical Gaussian clusters. Its centroids sit at the true geometric centre of each cluster.

KMedoids adds the constraint that its centroid must be an actual data point. For clean spherical clusters, this constraint means the medoid is almost never at the true centre — it is slightly off, causing slightly suboptimal cluster boundaries. The robustness advantage of KMedoids (resistance to outliers) is irrelevant when the data is clean.

In short: KMeans is the right tool here because the data matches its assumptions exactly.