fbpx

K-Means Clustering: Basics and Implementation

K-Means is a popular unsupervised machine learning algorithm used for clustering similar data points into groups. It partitions the data into K clusters, where each data point belongs to the cluster with the nearest mean. The “K” in K-Means represents the number of clusters to be formed.

Key Concepts:

  1. Centroid:
  • Each cluster is defined by a central point called a centroid, which represents the mean of all data points in that cluster.
  1. Distance Metric:
  • Euclidean distance is commonly used to measure the distance between data points and centroids.
  1. Objective Function (Inertia):
  • K-Means aims to minimize the sum of squared distances between data points and their respective cluster centroids. This measure is often referred to as inertia.
  1. Initialization:
  • K-Means is sensitive to the initial placement of centroids. Different initialization strategies can impact the final clustering result.

K-Means Algorithm:

  1. Initialization:
  • Randomly choose K data points as initial centroids.
  1. Assignment:
  • Assign each data point to the nearest centroid, forming K clusters.
  1. Update Centroids:
  • Recalculate the centroids as the mean of all data points in each cluster.
  1. Repeat:
  • Repeat steps 2 and 3 until convergence (when centroids no longer change significantly).

Implementation in Python:

Using the scikit-learn library for K-Means clustering:

# Import necessary libraries
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt

# Assume df is your DataFrame with features (X)
# Specify the number of clusters (K)
K = 3

# Initialize the K-Means model
model = KMeans(n_clusters=K, random_state=42)

# Fit the model to the data
model.fit(df)

# Get cluster labels and centroids
cluster_labels = model.labels_
centroids = model.cluster_centers_

# Add cluster labels to the original DataFrame
df['Cluster'] = cluster_labels

# Visualize the clusters (for 2D data)
plt.scatter(df['Feature1'], df['Feature2'], c=df['Cluster'], cmap='viridis', alpha=0.5, edgecolors='k')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200)
plt.title('K-Means Clustering')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.show()

Interpretation of Results:

  • Cluster Labels: Each data point is assigned a cluster label (0 to K-1) based on its proximity to the centroid.
  • Centroids: The final centroids represent the mean values of the data points in each cluster.
  • Visualization: The scatter plot with cluster assignments helps visualize the separation of data points into clusters.

Tips:

  • Choosing K: The choice of K is crucial. It can be determined using techniques like the elbow method or silhouette analysis.
  • Initialization Strategies: The default initialization may not always result in the best clustering. Experiment with different initialization methods, such as K-Means++, for better convergence.
  • Scaling: Standardize or normalize features if they are on different scales to ensure equal importance.
  • Outliers: K-Means is sensitive to outliers. Consider preprocessing or using robust variants if outliers are present.

K-Means is widely used for various applications, such as customer segmentation, image compression, and anomaly detection. Understanding its strengths and limitations is essential for effective use.