K-Means Algorithm

Scroll down to the first section for industrial and e.g. HR application examples.

Industrial and e.g. HR Application Examples (Non-industrial examples, such as HR, are located below the industrial example block.):

Industrial Use Case:

K-Means is effectively applied in manufacturing to optimize production processes and classify products or machinery. It’s a simple yet powerful clustering method that helps in data grouping.

Manufacturing Example: Quality Control and Product Classification Imagine an electronics components manufacturing plant producing various sensors where quality control is crucial, as the components’ performance and specifications may vary slightly. The goal is to group these components into categories that meet specific quality standards, making it easier to identify defective products.

How does K-Means work in this example?

Data Collection:

Gather data on each manufactured sensor, such as resistance, temperature sensitivity, electrical conductivity, size, and other characteristics. These features form the basis for grouping the products.

Applying the K-Means Algorithm:

Use the K-Means algorithm to create a predetermined number of clusters (e.g., 3 or 4) representing different quality categories. The number of clusters depends on allowable variations and quality expectations in the factory, such as “excellent quality,” “acceptable quality,” “average quality,” and “defective” product groups.

Identifying Clusters and Anomalies:

After clustering, each product is assigned to a group based on its proximity to the nearest centroid. Defective or out-of-range products are those far from any cluster centroids, indicating they might need further inspection or correction

Quality Control and Decision Making:

Products categorized as “defective” can be flagged for immediate review, allowing quick detection of manufacturing issues. Decisions can also be made based on “average” and “excellent” clusters, such as targeting different market segments or pricing strategies.

Production Optimization:

By analyzing inter-cluster differences, it becomes possible to identify production parameters causing quality variations. For instance, optimizing temperature control might increase the proportion of “excellent quality” products.

HR Example:

Identifying Career Development Paths

K-Means can be used to identify career development paths based on employees’ skill levels and experience. For instance, a company can cluster employees using parameters like years of service, number of positions held, and various skill proficiencies. The results can help develop tailored career progression plans for each group.

Defining Recruitment Target Groups

K-Means clustering can also be applied to data on potential new hires to identify different recruitment target groups. For example, if the recruitment database contains data on previous candidates (e.g., education, experience, skills), the candidates can be grouped accordingly. Each cluster can then have a specific recruitment strategy tailored to its profiles.

K-Means in Action (Expandable content – click the + icon on the right.):

Animation Example: In the attached video, you can see the K-Means algorithm in action, clustering 500 data points into 5 clusters. Here, the number of clusters (“K”) is predefined as 5, which can be adjusted as needed.

Explanation of the K-Means Algorithm (Expandable content – click the + icon on the right.):

K-Means (or K-Mean) is a commonly used unsupervised machine learning algorithm for clustering, organizing input data into groups (clusters) so that similar data points are grouped together, and different groups are as distinct as possible.

The algorithm aims to minimize the distance between data points and their cluster centroids. The core idea is to create a predetermined number of clusters (“K”) and arrange data around these centroids.

Algorithm Steps:
Initialize K Centroids:

Randomly select K (the desired number of clusters) centroids from the input data. The centroids represent the average position of data points within each cluster.
Assign Data Points to the Nearest Centroid:

Assign each data point to the cluster with the nearest centroid by calculating the Euclidean distance between data points and centroids.
Recalculate Centroids:

After all data points are assigned to clusters, recalculate the centroids. Each centroid becomes the average position of the data points in its cluster.
Iteration:

Repeat steps 2 and 3 until the centroids stabilize (i.e., no significant changes occur in their positions). This is typically measured using a tolerance value.
Finalize:

The algorithm stops when centroids no longer change or the maximum number of iterations is reached.
Mathematical Background and Distance Measurement:
The K-Means algorithm is based on distance measurement. The most common method is the Euclidean distance.

Advantages of K-Means:
Simplicity: Relatively easy to implement and execute.
Efficiency: Suitable for large datasets and one of the fastest clustering algorithms.
Flexibility: Works well with different data types if they are well-separated.
Disadvantages of K-Means:
Predefining K: The user must specify the number of clusters, which is not always straightforward.
Sensitivity to Initial Centroids: Since centroids are chosen randomly, results can vary. Smart initialization methods, such as K-Means++, can mitigate this issue.
Korlátozott a nem gömb alakú klaszterek esetén: A K-means nem biztos, hogy jól működik, ha a klaszterek nem gömb alakúak, mivel az Euklideszi távolság nem mindig képes jól jellemezni azokat.
Sensitive to Outliers: Outliers can significantly affect centroids and distort clustering results.
K-Means++ Initialization:
K-Means++ is an advanced initialization technique that reduces the chances of poor initial centroid selection, leading to better clustering results.

Use Cases:
Image Clustering: Identify similar pixels and group them.
Market Segmentation: Business analysts can create customer groups based on shopping habits.
Document Clustering: Group text documents based on similar topics.
Data Reduction: Used as a precursor to principal component analysis (PCA).
Summary:
The K-Means algorithm is one of the most well-known and efficient clustering methods, particularly useful when clusters are well-separated, and the desired number of clusters is known. However, it may not perform well when clusters have irregular shapes or many outliers.

Code Snippet: (Expandable content – click the + icon on the right.)

import numpy as np

class KMeansCustom:
    def __init__(self, n_clusters=5, max_iter=100, tol=1e-4):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.tol = tol
        self.centroids = None

    def fit(self, X):
        # Véletlenszerű inicializálás a középpontokhoz
        np.random.seed(0)
        initial_indices = np.random.choice(len(X), self.n_clusters, replace=False)
        self.centroids = X[initial_indices]

        for i in range(self.max_iter):
            # Hozzárendelés a legközelebbi középponthoz
            labels = self._assign_clusters(X)

            # Új középpontok számítása
            new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(self.n_clusters)])

            # Konvergencia ellenőrzése
            if np.all(np.abs(new_centroids - self.centroids) < self.tol):
                break

            self.centroids = new_centroids

        return self

    def _assign_clusters(self, X):
        # Euklideszi távolság számítása minden ponthoz a középpontokhoz
        distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)
        # A legközelebbi középpont indexének kiválasztása minden pont esetében
        return np.argmin(distances, axis=1)

# 500 véletlenszerű pont generálása 2D térben
np.random.seed(0)
points = np.random.rand(500, 2) * 100  # 100x100-as térben helyezkednek el a pontok

# Saját k-Means implementáció futtatása
kmeans_custom = KMeansCustom(n_clusters=5)
kmeans_custom.fit(points)