Polynomial Regression: A Comprehensive Guide

Polynomial Regression is a form of regression analysis where the relationship between the independent variable $x$ and the dependent variable $y$ is modeled as an $n$ -th degree polynomial. Unlike linear regression, which assumes a linear relationship between $x$ and $y$ , polynomial regression can model non-linear relationships by fitting a curve to the data.

Polynomial regression can capture more complex relationships between the input features and the target, making it a valuable tool when data shows a curvilinear pattern.

Key Concepts

1. Polynomial Function

Polynomial regression fits the data using an equation of the form:

y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \dots + \beta_n x^n + \epsilon

Where:

$y$ is the dependent variable (target).
$x$ is the independent variable (feature).
$\beta_0$ is the intercept (constant term).
$\beta_1, \beta_2, \dots, \beta_n$ are the coefficients (weights).
$\epsilon$ is the error term.

Here, the relationship between $x$ and $y$ is no longer a straight line but a polynomial curve of degree $n$ .

2. Polynomial Degree

Degree 1 (Linear Regression): When $n = 1$ , polynomial regression reduces to linear regression.
Higher Degrees: When $n > 1$ , the polynomial regression allows for a curve fit. As $n$ increases, the curve becomes more flexible and can fit the data more closely.

However, increasing the degree of the polynomial also increases the risk of overfitting. Overfitting occurs when the model becomes too complex and captures noise in the data, leading to poor generalization to unseen data.

3. Overfitting and Underfitting

Overfitting: A higher-degree polynomial may fit the training data perfectly but fail to generalize to new, unseen data. This happens when the model captures not just the true underlying patterns but also the noise and fluctuations in the data.
Underfitting: A lower-degree polynomial may fail to capture the underlying trends in the data, resulting in a model that cannot make accurate predictions, even on the training data.

The goal is to find a polynomial degree that provides the best balance between bias (underfitting) and variance (overfitting).

When to Use Polynomial Regression

Polynomial regression is particularly useful when:

The data shows a curved relationship between the features and target variable.
Linear regression does not fit well, and a more flexible model is needed to capture non-linear patterns.
You want to model interactions between features or higher-order terms.

Example:

Suppose you are trying to predict the price of a house based on its age. The relationship between age and price might not be linear (e.g., the price may decrease rapidly for newer houses and level off for older houses). Polynomial regression could capture this curved relationship more effectively than a linear model.

Example: Polynomial Regression in Python

Let’s work through an example where we use polynomial regression to predict house prices based on the size of the house. The relationship is expected to be non-linear.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Sample data: House size (in sq ft) and corresponding prices
X = np.array([[1500], [1800], [2400], [3000], [3500], [4000]])  # Size of the house (feature)
y = np.array([400000, 450000, 550000, 600000, 650000, 700000])  # Price of the house (target)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Linear regression (for comparison)
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)

# Polynomial regression (degree 2)
poly_features = PolynomialFeatures(degree=2)
X_poly = poly_features.fit_transform(X_train)  # Transform the feature matrix

# Fit the polynomial regression model
poly_regressor = LinearRegression()
poly_regressor.fit(X_poly, y_train)

# Predict using linear regression and polynomial regression
y_pred_linear = linear_regressor.predict(X_test)
y_pred_poly = poly_regressor.predict(poly_features.transform(X_test))

# Evaluate the models using Mean Squared Error
mse_linear = mean_squared_error(y_test, y_pred_linear)
mse_poly = mean_squared_error(y_test, y_pred_poly)

# Print the Mean Squared Error of both models
print(f"Mean Squared Error (Linear Regression): {mse_linear}")
print(f"Mean Squared Error (Polynomial Regression): {mse_poly}")

# Visualize the results
# Plot linear regression fit
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, linear_regressor.predict(X), color='red', label='Linear Fit')

# Plot polynomial regression fit
X_grid = np.arange(min(X), max(X), 0.1).reshape(-1, 1)  # For smooth curve
plt.plot(X_grid, poly_regressor.predict(poly_features.transform(X_grid)), color='green', label='Polynomial Fit')

plt.xlabel('House Size (sq ft)')
plt.ylabel('Price ($)')
plt.title('Polynomial Regression vs Linear Regression')
plt.legend()
plt.show()

Explanation:

Data: The dataset consists of house sizes (independent variable) and their corresponding prices (dependent variable).
Train-Test Split: We split the data into training and testing sets to evaluate model performance.
Polynomial Transformation: The PolynomialFeatures class from scikit-learn is used to transform the input feature $X$ into a polynomial form. In this example, we use a polynomial of degree 2.
Model Training: We train both a linear regression model and a polynomial regression model (of degree 2).
Prediction and Evaluation: The models make predictions on the test data, and the Mean Squared Error (MSE) is calculated to evaluate their performance.
Visualization: We plot both the linear regression and polynomial regression models for comparison. The polynomial regression should better capture the curve in the data.

Output:

You’ll see two curves:

The red line represents the linear regression fit, which is a straight line.
The green curve represents the polynomial regression fit, which is a curve that better follows the underlying data pattern.

Interpretation and Limitations

Interpretation:

Linear Model: In simple linear regression, we fit a straight line through the data, which may underfit the data if the relationship between the feature and target is non-linear.
Polynomial Model: Polynomial regression, with a degree greater than 1, introduces curvature to the model, allowing it to better fit data with non-linear patterns. However, as the degree increases, the model may become more sensitive to noise and result in overfitting.

Choosing the Degree:

Underfitting: If the polynomial degree is too low, the model might not capture the complexity of the data, leading to underfitting.
Overfitting: If the polynomial degree is too high, the model may fit the noise in the data, leading to overfitting. It is important to test different polynomial degrees and select the one that minimizes the generalization error.

Regularization Techniques

To prevent overfitting, you can use regularization techniques such as Ridge or Lasso Regression with polynomial regression. These techniques add a penalty to the model complexity (e.g., large polynomial coefficients), helping to improve generalization.

Conclusion

Polynomial regression is a valuable tool for modeling non-linear relationships in data. By increasing the degree of the polynomial, you can capture more complex patterns in the data. However, careful attention must be paid to avoid overfitting, especially as the degree of the polynomial increases. Regularization and cross-validation are important techniques to help mitigate the risks of overfitting.

deltagradient