DBSCAN

Scroll down to the first section for industrial and e.g. HR application examples.

Industrial and e.g. HR Application Examples (Non-industrial examples, such as HR, are located below the industrial example block.):

Industrial Example:

Monitoring Machine Conditions In an automobile factory, various machines are used in the production process, such as welding machines, painting robots, and assembly lines. Sensors are installed on each machine to measure parameters like temperature, vibration, pressure, and other indicators that help determine the machine's operational state.

How DBSCAN is Applied:
Data Collection Real-time data is collected from various machines in the factory (e.g., temperature, vibrations, operating hours, etc.).

Using DBSCAN:

The DBSCAN algorithm clusters (groups) the data, categorizing machines with similar operational parameters into the same group. Machines of the same type that fall outside the normal operational range (e.g., unusually high temperature or vibration) are identified as "noise" by the algorithm and can generate alerts for the maintenance department.

Anomaly Detection:

If one machine's parameters differ significantly from others of the same type (e.g., a painting robot overheating), DBSCAN detects the outlier and signals potential failure.

Repairs and Optimization:

The maintenance team is notified immediately about the problematic machine before it breaks down completely. This minimizes unplanned downtime, improves machine availability, and optimizes maintenance costs.

HR Example:

1. Identifying Recruitment Patterns

To optimise recruitment processes, the DBSCAN algorithm can be used to analyse historical data on new hires. For instance, the algorithm might identify which profiles or skill combinations have been most successful for specific roles, thereby refining the recruitment process.

2. Predicting Employee Turnover:

Retaining employees is a key focus, and analysing turnover patterns can help. DBSCAN can identify trends associated with employee departures, such as lack of engagement, frequent absences, or low performance. This helps identify at-risk groups, allowing HR to take proactive steps to improve retention.

3. Clustering Based on Performance Evaluation:

HR databases often contain various performance metrics (e.g., number of projects, adherence to deadlines, etc.). Using DBSCAN, employees can be grouped by performance levels. The algorithm automatically detects performance patterns, highlighting exceptional or underperforming employees, which enables targeted development initiatives or reward systems.

Animated DBSCAN Clustering: (Expandable content – click the + icon on the right.)

In these videos, we see the DBSCAN unsupervised clustering algorithm in action.

We observe how different clusters emerge point by point, without a predefined number of clusters. Additionally, outliers (points that do not belong to any cluster) are marked in a separate color, demonstrating anomaly detection.

In the first case, the code randomly generates 300 points on a 100x100 two-dimensional plane (with a unit distance of 1) and clusters them using an Epsilon parameter (neighborhood radius) of 7.3 units and a minimum of 4 neighbors.

In the second case, the code randomly generates a sad face on a 500x500 two-dimensional plane (with a unit distance of 1) and clusters the points using an Epsilon parameter of 65 units and a minimum of 3 neighbors.

Explanation: (Expandable content – click the + icon on the right.)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups data into clusters without requiring the number of clusters to be predefined. This algorithm is particularly effective in handling noise (outliers) and data that follows density patterns.

How DBSCAN Works
The DBSCAN algorithm uses the following key concepts:

Data Point Density: DBSCAN examines the neighborhood (adjacent data points) and uses two fundamental parameters:

Eps (ε): The radius within which a data point is considered a neighbor of another data point.
MinPts: The minimum number of points within the ε-radius for a point to qualify as a core point.

The three types of data points:

Core Points: Points with at least MinPts neighbors within the ε-radius.
Border Points: Points within the ε-radius of a core point but with fewer than MinPts neighbors.
Noise Points (Outliers): Points outside the ε-radius of all other points, not belonging to any cluster.

Clustering Process:

DBSCAN selects an arbitrary data point, and if it has more than MinPts neighbors within the epsilon radius, it starts forming a cluster.
Clusters are expanded by adding all neighboring points.
Data points that do not belong to any cluster are considered noise points (outliers).

Applications
Density-Based Clustering: DBSCAN is highly effective for irregularly shaped clusters (not just spherical) and data containing noise or outliers. No need to predetermine the number of clusters.

Key Application Areas:

Geographic Data Analysis: DBSCAN can be used for clustering spatial data, such as identifying cities, restaurants, or stores based on their location.

Anomaly Detection: Useful for identifying noise points or outliers in datasets, which can represent unusual events or erroneous data.

Image Processing: Applicable for grouping objects in images or identifying specific visual patterns.

Network Data Mining: In internet or transportation networks, DBSCAN is valuable for analyzing the density of routes, nodes, and the connections between them. For example, it can be used to identify densely connected pathways, critical intersections, or areas with significant activity within the network.

Advantages:

Finds clusters without predefined cluster numbers. Effectively handles noise and outliers. Detects clusters of various shapes (not limited to spherical ones).

Disadvantages:

Sensitive to parameter selection, especially Eps and MinPts.
Can be slow with large, dense datasets.

Example

Suppose we are working with traffic data for a city and want to identify areas with high traffic density. Using DBSCAN, we can cluster the traffic data to pinpoint crowded, high-traffic areas (such as central intersections), while treating less-used roads or noisy data (e.g., faulty GPS coordinates) as outliers.

Overall, DBSCAN is a powerful tool for density-based clustering, especially in cases where the number of clusters is unknown and the data contains noise points or outliers.

Code Snippet: (Expandable content – click the + icon on the right.)

import numpy as np

# Távolság számítása két pont között (euklideszi távolság)

def euclidean_distance(point1, point2):

return np.sqrt(np.sum((point1 - point2) ** 2))

# Szomszédok keresése (Eps távolságon belüli pontok)

def region_query(data, point_idx, eps):

neighbors = []

for i in range(len(data)):

if euclidean_distance(data[point_idx], data[i]) <= eps:

neighbors.append(i)

return neighbors

# Klaszter kibővítése

def expand_cluster(data, labels, point_idx, neighbors, cluster_id, eps, min_pts, visited):

labels[point_idx] = cluster_id

i = 0

while i < len(neighbors): neighbor_idx = neighbors[i] if not visited[neighbor_idx]: visited[neighbor_idx] = True new_neighbors = region_query(data, neighbor_idx, eps) if len(new_neighbors) >= min_pts:

neighbors.extend(new_neighbors)

if labels[neighbor_idx] == -1:

labels[neighbor_idx] = cluster_id

i += 1

return labels

# DBSCAN algoritmus

def dbscan(data, eps, min_pts):

labels = np.full(len(data), -1)

visited = np.full(len(data), False)

cluster_id = 0

for i in range(len(data)):

if visited[i]:

continue

visited[i] = True

neighbors = region_query(data, i, eps)

if len(neighbors) < min_pts:

labels[i] = -1 # Zajpont

else:

cluster_id += 1

labels = expand_cluster(data, labels, i, neighbors, cluster_id, eps, min_pts, visited)

return labels

# Pontok generálása

np.random.seed(42)

data = np.random.rand(300, 2) * 100 # 300 pont két dimenzióban, 0 és 100 között

# Paraméterek beállítása

eps = 7.3 # Sugár

min_pts = 4 # Minimális szomszédok száma

# DBSCAN alkalmazása az adatkészletre

labels = dbscan(data, eps, min_pts)