🐍Python Programming

Classification & Clustering (Decision Trees, K-Means)

Updated 2026-05-15

30 min read

Classification & Clustering (Decision Trees, K-Means)

In the realm of data science and machine learning, classification and clustering are two fundamental techniques used to analyze and categorize data. Classification involves assigning predefined labels to data points based on their features, while clustering groups similar data points together without any prior labeling.

Understanding these techniques is crucial for building predictive models that can make informed decisions based on data. In this tutorial, we'll explore several popular methods including Decision Trees, Logistic Regression, K-Means Clustering, and K-Nearest Neighbors (KNN). We'll also delve into evaluating model performance using confusion matrices and ROC/AUC curves.

Introduction

Classification is a supervised learning technique where the goal is to predict categorical labels for new data points. Common algorithms include Decision Trees, Logistic Regression, and Support Vector Machines.

Clustering, on the other hand, is an unsupervised learning method used to group similar data points together based on their features. K-Means Clustering is one of the most widely used clustering techniques.

In this tutorial, we'll use Python's scikit-learn library, which provides simple and efficient tools for data mining and data analysis. By the end of this tutorial, you'll have a solid understanding of how to implement these techniques and evaluate their performance.

Core Content

Decision Trees

Decision Trees are tree-like models where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents an outcome or class label. They are easy to interpret and can handle both numerical and categorical data.

Example: Implementing a Decision Tree Classifier

Python

1# script.py
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.tree import DecisionTreeClassifier
5import numpy as np
6 
7# Load the Iris dataset
8data = load_iris()
9X, y = data.data, data.target
10 
11# Split the dataset into training and testing sets
12X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
13 
14# Create a Decision Tree Classifier
15clf = DecisionTreeClassifier(random_state=42)
16 
17# Train the classifier
18clf.fit(X_train, y_train)
19 
20# Predict on the testing set
21predictions = clf.predict(X_test)
22print(predictions)

Output

K-Means Clustering

K-Means Clustering is a method of partitioning data into K distinct, non-overlapping subsets (clusters). It aims to minimize the variance within each cluster.

Example: Implementing K-Means Clustering

Python

1# script.py
2from sklearn.datasets import make_blobs
3from sklearn.cluster import KMeans
4import matplotlib.pyplot as plt
5 
6# Generate synthetic data
7X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
8 
9# Create a K-Means Clustering model
10kmeans = KMeans(n_clusters=4, random_state=0)
11 
12# Fit the model to the data
13kmeans.fit(X)
14 
15# Get the cluster labels
16labels = kmeans.labels_
17 
18# Plot the clusters
19plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
20plt.title('K-Means Clustering')
21plt.show()

Output

ROC/AUC Curve

The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The Area Under the Curve (AUC) provides an aggregate measure of performance across all classification thresholds.

Example: Plotting the ROC Curve and Calculating AUC

Python

1# script.py
2from sklearn.metrics import roc_curve, auc
3import matplotlib.pyplot as plt
4 
5# True labels and predicted probabilities
6y_true = [0, 1, 1, 0, 1]
7y_scores = [0.2, 0.4, 0.6, 0.8, 0.9]
8 
9# Compute ROC curve and AUC
10fpr, tpr, _ = roc_curve(y_true, y_scores)
11roc_auc = auc(fpr, tpr)
12 
13# Plot the ROC curve
14plt.figure()
15plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
16plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
17plt.xlim([0.0, 1.0])
18plt.ylim([0.0, 1.05])
19plt.xlabel('False Positive Rate')
20plt.ylabel('True Positive Rate')
21plt.title('Receiver Operating Characteristic')
22plt.legend(loc="lower right")
23plt.show()

Output

Practical Example

Let's implement a complete example that combines several of the techniques discussed above. We'll use the Iris dataset to classify the species of iris flowers using a Decision Tree and evaluate its performance.

Python

1# script.py
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.tree import DecisionTreeClassifier
5from sklearn.metrics import confusion_matrix, classification_report
6 
7# Load the Iris dataset
8data = load_iris()
9X, y = data.data, data.target
10 
11# Split the dataset into training and testing sets
12X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
13 
14# Create a Decision Tree Classifier
15clf = DecisionTreeClassifier(random_state=42)
16 
17# Train the classifier
18clf.fit(X_train, y_train)
19 
20# Predict on the testing set
21predictions = clf.predict(X_test)
22 
23# Evaluate the model
24cm = confusion_matrix(y_test, predictions)
25cr = classification_report(y_test, predictions)
26 
27print("Confusion Matrix:")
28print(cm)
29print("
30Classification Report:")
31print(cr)

Output

Confusion Matrix:
[[13 0 0]
[0 14 0]
[0 0 12]]

Classification Report:
            precision    recall  f1-score   support

         0       1.00      1.00      1.00        13
         1       1.00      1.00      1.00        14
         2       1.00      1.00      1.00        12

  accuracy                           1.00        39
 macro avg       1.00      1.00      1.00        39
weighted avg       1.00      1.00      1.00        39

Summary

Technique	Description
Decision Trees	Tree-like model for classification, easy to interpret.
Logistic Regression	Statistical method for binary classification problems.
K-Means Clustering	Unsupervised learning technique for grouping data into clusters.
Confusion Matrix	Table to evaluate the performance of a classification model.
ROC/AUC Curve	Graphical plot and metric to assess binary classifier performance.
K-Nearest Neighbors	Instance-based learning algorithm based on nearest neighbors in feature space.

What's Next?

In the next tutorial, we'll dive into TensorFlow & PyTorch Basics, where you'll learn how to build neural networks using these powerful deep learning frameworks. This will be a great transition from traditional machine learning techniques to more advanced models.

Stay tuned for more exciting content on data science and machine learning!

🐍Python Programming

Classification & Clustering (Decision Trees, K-Means)

Updated 2026-05-15

30 min read

Classification & Clustering (Decision Trees, K-Means)

Introduction

Core Content

Decision Trees

Example: Implementing a Decision Tree Classifier

Python

1# script.py
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.tree import DecisionTreeClassifier
5import numpy as np
6 
7# Load the Iris dataset
8data = load_iris()
9X, y = data.data, data.target
10 
11# Split the dataset into training and testing sets
12X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
13 
14# Create a Decision Tree Classifier
15clf = DecisionTreeClassifier(random_state=42)
16 
17# Train the classifier
18clf.fit(X_train, y_train)
19 
20# Predict on the testing set
21predictions = clf.predict(X_test)
22print(predictions)

Output

K-Means Clustering

K-Means Clustering is a method of partitioning data into K distinct, non-overlapping subsets (clusters). It aims to minimize the variance within each cluster.

Example: Implementing K-Means Clustering

Python

1# script.py
2from sklearn.datasets import make_blobs
3from sklearn.cluster import KMeans
4import matplotlib.pyplot as plt
5 
6# Generate synthetic data
7X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
8 
9# Create a K-Means Clustering model
10kmeans = KMeans(n_clusters=4, random_state=0)
11 
12# Fit the model to the data
13kmeans.fit(X)
14 
15# Get the cluster labels
16labels = kmeans.labels_
17 
18# Plot the clusters
19plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
20plt.title('K-Means Clustering')
21plt.show()

Output

ROC/AUC Curve

Example: Plotting the ROC Curve and Calculating AUC

Python

1# script.py
2from sklearn.metrics import roc_curve, auc
3import matplotlib.pyplot as plt
4 
5# True labels and predicted probabilities
6y_true = [0, 1, 1, 0, 1]
7y_scores = [0.2, 0.4, 0.6, 0.8, 0.9]
8 
9# Compute ROC curve and AUC
10fpr, tpr, _ = roc_curve(y_true, y_scores)
11roc_auc = auc(fpr, tpr)
12 
13# Plot the ROC curve
14plt.figure()
15plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
16plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
17plt.xlim([0.0, 1.0])
18plt.ylim([0.0, 1.05])
19plt.xlabel('False Positive Rate')
20plt.ylabel('True Positive Rate')
21plt.title('Receiver Operating Characteristic')
22plt.legend(loc="lower right")
23plt.show()

Output

Practical Example

Python

1# script.py
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.tree import DecisionTreeClassifier
5from sklearn.metrics import confusion_matrix, classification_report
6 
7# Load the Iris dataset
8data = load_iris()
9X, y = data.data, data.target
10 
11# Split the dataset into training and testing sets
12X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
13 
14# Create a Decision Tree Classifier
15clf = DecisionTreeClassifier(random_state=42)
16 
17# Train the classifier
18clf.fit(X_train, y_train)
19 
20# Predict on the testing set
21predictions = clf.predict(X_test)
22 
23# Evaluate the model
24cm = confusion_matrix(y_test, predictions)
25cr = classification_report(y_test, predictions)
26 
27print("Confusion Matrix:")
28print(cm)
29print("
30Classification Report:")
31print(cr)

Output

Confusion Matrix:
[[13 0 0]
[0 14 0]
[0 0 12]]

Classification Report:
            precision    recall  f1-score   support

         0       1.00      1.00      1.00        13
         1       1.00      1.00      1.00        14
         2       1.00      1.00      1.00        12

  accuracy                           1.00        39
 macro avg       1.00      1.00      1.00        39
weighted avg       1.00      1.00      1.00        39

Summary

Technique	Description
Decision Trees	Tree-like model for classification, easy to interpret.
Logistic Regression	Statistical method for binary classification problems.
K-Means Clustering	Unsupervised learning technique for grouping data into clusters.
Confusion Matrix	Table to evaluate the performance of a classification model.
ROC/AUC Curve	Graphical plot and metric to assess binary classifier performance.
K-Nearest Neighbors	Instance-based learning algorithm based on nearest neighbors in feature space.

What's Next?

Stay tuned for more exciting content on data science and machine learning!