Implementing Support Vector Machines (SVMs) from Scratch

Support Vector Machines (SVMs) are a powerful set of supervised learning methods used for classification, regression, and outlier detection. This tutorial will guide you through implementing SVMs from scratch, focusing on classification. By the end, you’ll understand the theory behind SVMs and how to code them without relying on external libraries. This tutorial assumes you have a good grasp of Python and linear algebra.

1. Introduction to Support Vector Machines

What is a Support Vector Machine?

A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for classification or regression challenges. It works by finding the hyperplane that best divides a dataset into classes. In a two-dimensional space, this hyperplane is a line dividing a plane into two parts where each class lies on either side.

Why Use SVM?

Effective in high-dimensional spaces: Particularly useful when the number of dimensions exceeds the number of samples.
Memory efficient: Uses a subset of training points in the decision function (called support vectors).
Versatile: Different kernel functions can be specified for the decision function. Common kernels include linear, polynomial, RBF (Gaussian), and sigmoid.

2. Mathematical Foundation of SVM

Hyperplanes and Support Vectors

A hyperplane in an $n$ -dimensional space is a flat affine subspace of dimension $n-1$ . For a 2D space, the hyperplane is a line. In SVM, we aim to find a hyperplane that maximizes the margin between the two classes. The points lying closest to the hyperplane are called support vectors.

Margin

The margin is the distance between the hyperplane and the closest data points from either class. Maximizing the margin helps improve the model’s generalization ability.

Mathematical Formulation

Given a training dataset $\{(x_i, y_i)\}_{i=1}^n$ where $x_i \in \mathbb{R}^n$ and $y_i \in \{-1, 1\}$ , the decision function for a linear SVM is:

$f(x) = \mathbf{w}^T \mathbf{x} + b$

The goal is to find $\mathbf{w}$ and $b$ such that the margin is maximized. This can be formulated as a constrained optimization problem:

Minimize $\frac{1}{2} ||\mathbf{w}||^2$

Subject to $y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1$ for all $i$ .

The Dual Problem

Using Lagrange multipliers, the above problem can be converted into its dual form, which is easier to solve:

Maximize:

$L(\alpha) = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$

Subject to:

$\sum_{i=1}^n \alpha_i y_i = 0$

$\alpha_i \geq 0 \text{ for all } i$

where $\alpha_i$ are the Lagrange multipliers.

Kernel Trick

The kernel trick allows SVM to create non-linear decision boundaries by transforming the input space into a higher-dimensional space where a linear separation is possible. Common kernels include:

Linear Kernel: $K(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T \mathbf{x}'$
Polynomial Kernel: $K(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T \mathbf{x}' + c)^d$
Radial Basis Function (RBF) Kernel: $K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma ||\mathbf{x} - \mathbf{x}'||^2)$

3. Implementing SVM from Scratch

Data Preprocessing

Before we start coding the SVM, let’s preprocess the data. We’ll use the Iris dataset for simplicity.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data[:100, :2]  # We will use only the first two features and two classes for simplicity
y = iris.target[:100]

# Convert the labels to {-1, 1}
y = np.where(y == 0, -1, 1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)Code language: Python (python)

Kernel Functions

Let’s implement some kernel functions:

def linear_kernel(x1, x2):
    return np.dot(x1, x2)

def polynomial_kernel(x1, x2, degree=3, coef0=1):
    return (np.dot(x1, x2) + coef0) ** degree

def rbf_kernel(x1, x2, gamma=0.1):
    return np.exp(-gamma * np.linalg.norm(x1 - x2) ** 2)Code language: Python (python)

Optimization Problem

The optimization problem involves finding the values of $latex \alpha) that maximize the dual problem. We will use the Sequential Minimal Optimization (SMO) algorithm to solve this.

Sequential Minimal Optimization (SMO)

SMO breaks the problem into smaller subproblems, which are then solved analytically. This approach significantly simplifies the optimization process.

class SVM:
    def __init__(self, kernel=linear_kernel, C=1.0, tol=1e-3, max_passes=5):
        self.kernel = kernel
        self.C = C
        self.tol = tol
        self.max_passes = max_passes

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.alpha = np.zeros(n_samples)
        self.b = 0
        self.X = X
        self.y = y

        passes = 0
        while passes < self.max_passes:
            num_changed_alphas = 0
            for i in range(n_samples):
                E_i = self._decision_function(X[i]) - y[i]
                if (y[i] * E_i < -self.tol and self.alpha[i] < self.C) or (y[i] * E_i > self.tol and self.alpha[i] > 0):
                    j = np.random.randint(0, n_samples)
                    while j == i:
                        j = np.random.randint(0, n_samples)
                    E_j = self._decision_function(X[j]) - y[j]

                    alpha_i_old = self.alpha[i]
                    alpha_j_old = self.alpha[j]

                    if y[i] != y[j]:
                        L = max(0, self.alpha[j] - self.alpha[i])
                        H = min(self.C, self.C + self.alpha[j] - self.alpha[i])
                    else:
                        L = max(0, self.alpha[j] + self.alpha[i] - self.C)
                        H = min(self.C, self.alpha[j] + self.alpha[i])

                    if L == H:
                        continue

                    eta = 2 * self.kernel(X[i], X[j]) - self.kernel(X[i], X[i]) - self.kernel(X[j], X[j])
                    if eta >= 0:
                        continue

                    self.alpha[j] -= y[j] * (E_i - E_j) / eta
                    self.alpha[j] = np.clip(self.alpha[j], L, H)

                    if abs(self.alpha[j] - alpha_j_old) < 1e-5:
                        continue

                    self.alpha[i] += y[i] * y[j] * (alpha_j_old - self.alpha[j])

                    b1 = self.b - E_i - y[i] * (self.alpha[i] - alpha_i_old) * self.kernel(X[i], X[i]) - y[j] * (self.alpha[j] - alpha_j_old) * self.kernel(X[i], X[j])
                    b2 = self.b - E_j - y[i] * (self.alpha[i] - alpha_i_old) * self.kernel(X[i], X[j]) - y[j] * (self.alpha[j] - alpha_j_old) * self.kernel(X[j], X[j])

                    if 0 < self.alpha[i] < self.C:
                        self.b = b1
                    elif 0 < self.alpha[j] < self.C:
                        self.b = b2
                    else:
                        self.b = (b1 + b2) / 2

                    num_changed_alphas += 1

            if num_changed_alphas == 0:
                passes += 1
            else:
                passes = 0

    def _decision_function(self, X):
        return np.dot((self.alpha

 * self.y), self.kernel(self.X, X)) + self.b

    def predict(self, X):
        return np.sign(self._decision_function(X))Code language: Python (python)

Training and Testing the SVM

Now, let’s train our SVM model and test it on the test data.

# Initialize the SVM with a linear kernel
svm = SVM(kernel=linear_kernel, C=1.0)

# Train the SVM
svm.fit(X_train, y_train)

# Predict the test data
y_pred = svm.predict(X_test)

# Calculate the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy * 100:.2f}%')Code language: Python (python)

6. Evaluation Metrics

While accuracy is a common metric, it’s important to consider other metrics, especially when dealing with imbalanced datasets.

Precision, Recall, and F1-Score

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1-Score: {f1:.2f}')Code language: Python (python)

Confusion Matrix

A confusion matrix provides a detailed breakdown of prediction results.

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(conf_matrix)Code language: Python (python)

7. Optimizations and Practical Tips

Handling Imbalanced Data

When dealing with imbalanced data, you can adjust the class weights.

class SVM:
    def __init__(self, kernel=linear_kernel, C=1.0, tol=1e-3, max_passes=5, class_weight=None):
        self.kernel = kernel
        self.C = C
        self.tol = tol
        self.max_passes = max_passes
        self.class_weight = class_weight

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.alpha = np.zeros(n_samples)
        self.b = 0
        self.X = X
        self.y = y

        if self.class_weight:
            weight = np.vectorize(self.class_weight.get)(y)
        else:
            weight = np.ones(n_samples)

        passes = 0
        while passes < self.max_passes:
            num_changed_alphas = 0
            for i in range(n_samples):
                E_i = self._decision_function(X[i]) - y[i]
                if (y[i] * E_i < -self.tol and self.alpha[i] < self.C * weight[i]) or (y[i] * E_i > self.tol and self.alpha[i] > 0):
                    j = np.random.randint(0, n_samples)
                    while j == i:
                        j = np.random.randint(0, n_samples)
                    E_j = self._decision_function(X[j]) - y[j]

                    alpha_i_old = self.alpha[i]
                    alpha_j_old = self.alpha[j]

                    if y[i] != y[j]:
                        L = max(0, self.alpha[j] - self.alpha[i])
                        H = min(self.C * weight[j], self.C * weight[j] + self.alpha[j] - self.alpha[i])
                    else:
                        L = max(0, self.alpha[j] + self.alpha[i] - self.C * weight[j])
                        H = min(self.C * weight[j], self.alpha[j] + self.alpha[i])

                    if L == H:
                        continue

                    eta = 2 * self.kernel(X[i], X[j]) - self.kernel(X[i], X[i]) - self.kernel(X[j], X[j])
                    if eta >= 0:
                        continue

                    self.alpha[j] -= y[j] * (E_i - E_j) / eta
                    self.alpha[j] = np.clip(self.alpha[j], L, H)

                    if abs(self.alpha[j] - alpha_j_old) < 1e-5:
                        continue

                    self.alpha[i] += y[i] * y[j] * (alpha_j_old - self.alpha[j])

                    b1 = self.b - E_i - y[i] * (self.alpha[i] - alpha_i_old) * self.kernel(X[i], X[i]) - y[j] * (self.alpha[j] - alpha_j_old) * self.kernel(X[i], X[j])
                    b2 = self.b - E_j - y[i] * (self.alpha[i] - alpha_i_old) * self.kernel(X[i], X[j]) - y[j] * (self.alpha[j] - alpha_j_old) * self.kernel(X[j], X[j])

                    if 0 < self.alpha[i] < self.C * weight[i]:
                        self.b = b1
                    elif 0 < self.alpha[j] < self.C * weight[j]:
                        self.b = b2
                    else:
                        self.b = (b1 + b2) / 2

                    num_changed_alphas += 1

            if num_changed_alphas == 0:
                passes += 1
            else:
                passes = 0Code language: Python (python)

Feature Scaling

Feature scaling can significantly impact the performance of SVMs. Ensure that your data is standardized or normalized before training.

Parameter Tuning

Tuning parameters like $C$ , kernel type, and kernel parameters is crucial for achieving optimal performance. Use grid search or randomized search to find the best parameters.

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'poly', 'rbf'],
    'gamma': [0.001, 0.01, 0.1, 1],
    'degree': [2, 3, 4]
}

# Initialize the SVM model
svc = SVC()

# Initialize the grid search
grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy')

# Fit the grid search
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Score: {grid_search.best_score_}')Code language: Python (python)

8. Conclusion

In this tutorial, we’ve implemented a Support Vector Machine (SVM) from scratch. We’ve covered the mathematical foundations, implemented the SMO algorithm, and explored how to handle real-world challenges like imbalanced data and feature scaling. By understanding the inner workings of SVMs, you can better leverage their power and apply them effectively in various machine learning tasks.

Practice Problems

Implement SVM with a different kernel (e.g., polynomial or RBF) from scratch.
Apply your SVM implementation to a different dataset (e.g., the digits dataset).
Experiment with different values of $C$ and observe how it affects the decision boundary and performance.

By following this tutorial and experimenting further, you’ll gain a deep understanding of SVMs and be well-equipped to use them in your machine learning projects.