K-Means is a popular unsupervised machine learning algorithm used for clustering similar data points into groups. It partitions the data into K clusters, where each data point belongs to the cluster with the nearest mean. The “K” in K-Means represents the number of clusters to be formed.
Key Concepts:
- Centroid:
- Each cluster is defined by a central point called a centroid, which represents the mean of all data points in that cluster.
- Distance Metric:
- Euclidean distance is commonly used to measure the distance between data points and centroids.
- Objective Function (Inertia):
- K-Means aims to minimize the sum of squared distances between data points and their respective cluster centroids. This measure is often referred to as inertia.
- Initialization:
- K-Means is sensitive to the initial placement of centroids. Different initialization strategies can impact the final clustering result.
K-Means Algorithm:
- Initialization:
- Randomly choose K data points as initial centroids.
- Assignment:
- Assign each data point to the nearest centroid, forming K clusters.
- Update Centroids:
- Recalculate the centroids as the mean of all data points in each cluster.
- Repeat:
- Repeat steps 2 and 3 until convergence (when centroids no longer change significantly).
Implementation in Python:
Using the scikit-learn
library for K-Means clustering:
# Import necessary libraries
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
# Assume df is your DataFrame with features (X)
# Specify the number of clusters (K)
K = 3
# Initialize the K-Means model
model = KMeans(n_clusters=K, random_state=42)
# Fit the model to the data
model.fit(df)
# Get cluster labels and centroids
cluster_labels = model.labels_
centroids = model.cluster_centers_
# Add cluster labels to the original DataFrame
df['Cluster'] = cluster_labels
# Visualize the clusters (for 2D data)
plt.scatter(df['Feature1'], df['Feature2'], c=df['Cluster'], cmap='viridis', alpha=0.5, edgecolors='k')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200)
plt.title('K-Means Clustering')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.show()
Interpretation of Results:
- Cluster Labels: Each data point is assigned a cluster label (0 to K-1) based on its proximity to the centroid.
- Centroids: The final centroids represent the mean values of the data points in each cluster.
- Visualization: The scatter plot with cluster assignments helps visualize the separation of data points into clusters.
Tips:
- Choosing K: The choice of K is crucial. It can be determined using techniques like the elbow method or silhouette analysis.
- Initialization Strategies: The default initialization may not always result in the best clustering. Experiment with different initialization methods, such as K-Means++, for better convergence.
- Scaling: Standardize or normalize features if they are on different scales to ensure equal importance.
- Outliers: K-Means is sensitive to outliers. Consider preprocessing or using robust variants if outliers are present.
K-Means is widely used for various applications, such as customer segmentation, image compression, and anomaly detection. Understanding its strengths and limitations is essential for effective use.