Data Splitting: Training, Validation, and Test Sets in Machine Learning
In machine learning, splitting the dataset into training, validation, and test sets is a critical step that ensures your model generalizes well to new, unseen data. This process helps to assess the model’s performance during the development phase and ensures that it is not overfitting to the training data.
In this guide, we will discuss the importance of each data split, how to properly perform data splitting, and common practices used in machine learning workflows.
1. Why Data Splitting is Important?
Data splitting ensures that the model is evaluated on data it has never seen during training. This helps to:
- Prevent overfitting: By testing the model on data that it hasn’t seen before, you can assess whether the model is too complex and has simply memorized the training data.
- Model evaluation: It provides an unbiased evaluation of the model’s performance on unseen data.
- Hyperparameter tuning: The validation set helps you tune the hyperparameters of the model without affecting the test set, which should be reserved for final evaluation.
Without proper data splitting, you may end up with misleading performance metrics, leading to poor decisions about model improvements.
2. Types of Data Splits
2.1. Training Set
The training set is the subset of data used to train the machine learning model. It is where the model learns to identify patterns in the data. During training, the model adjusts its internal parameters to minimize the error between the predicted and actual outputs.
- Size: Typically, the training set makes up the largest portion of the dataset, often around 60%-80% of the total data.
- Purpose: This set is used for learning the relationships in the data. The model's weights (parameters) are updated based on this data.
2.2. Validation Set
The validation set is used to evaluate the model's performance during the training phase, specifically for hyperparameter tuning. This set allows you to test the model's ability to generalize without using the test data. The validation set provides insight into how the model is performing and whether it is overfitting or underfitting.
- Size: Typically, the validation set accounts for around 10%-20% of the data.
- Purpose: It is used to validate the model during training, tune hyperparameters (like learning rate, number of layers, etc.), and make adjustments to the model based on the performance on the validation set.
2.3. Test Set
The test set is used to evaluate the model's final performance after training and hyperparameter tuning. It provides an estimate of how the model will perform on new, unseen data. The test set should never be used during the training process or to tune hyperparameters, as doing so would introduce bias into the model evaluation.
- Size: Typically, the test set accounts for 10%-30% of the dataset.
- Purpose: It is used for final evaluation after the model has been trained and validated. This is the data that will give you an unbiased assessment of the model's generalization ability.
3. Data Splitting Strategies
3.1. Holdout Method
The Holdout method is the most basic form of data splitting. The dataset is divided into three sets: training, validation, and test. This method is simple and effective but may be inefficient if the dataset is small.
Process:
- Split the dataset into training, validation, and test sets.
- Train the model on the training set.
- Use the validation set for model tuning (hyperparameter selection).
- Finally, evaluate the model on the test set.
Code Example (Holdout Method):
from sklearn.model_selection import train_test_split
import pandas as pd
# Sample data
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45, 50, 55, 60],
'Salary': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000],
'Purchased': [0, 1, 1, 0, 1, 0, 1, 0] # Target variable
})
# Split data into training and test sets (80% train, 20% test)
X = df[['Age', 'Salary']] # Features
y = df['Purchased'] # Target
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
# Split the remaining data into validation and test sets (50% validation, 50% test of the 40% temp set)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
print("Training set:")
print(X_train, y_train)
print("\nValidation set:")
print(X_val, y_val)
print("\nTest set:")
print(X_test, y_test)
3.2. k-Fold Cross-Validation
In k-fold cross-validation, the dataset is split into k subsets (or "folds"). The model is trained and validated k times, each time using a different fold for validation and the remaining folds for training. This method provides a more reliable estimate of the model's performance, especially when the dataset is small.
Process:
- Split the dataset into k folds.
- Train the model on folds and validate it on the remaining fold.
- Repeat this process k times, ensuring each fold gets used as the validation set exactly once.
- Average the model performance across all k iterations to get the final performance estimate.
Code Example (k-Fold Cross-Validation):
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Sample data
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45, 50, 55, 60],
'Salary': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000],
'Purchased': [0, 1, 1, 0, 1, 0, 1, 0] # Target variable
})
# Features and target variable
X = df[['Age', 'Salary']]
y = df['Purchased']
# Initialize a model (e.g., RandomForest)
model = RandomForestClassifier()
# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
# Display the cross-validation scores and the average score
print("Cross-validation scores:", cv_scores)
print("Average cross-validation score:", cv_scores.mean())
3.3. Stratified Split
In some cases, particularly with imbalanced datasets (where one class significantly outnumbers the other), a stratified split ensures that the distribution of the target variable is similar across the training, validation, and test sets. This can be especially important for classification problems.
Process:
- Ensure that each data split (training, validation, test) has a proportional representation of each class.
Code Example (Stratified Split):
from sklearn.model_selection import train_test_split
import pandas as pd
# Sample data with imbalanced target variable
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45, 50, 55, 60],
'Salary': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000],
'Purchased': [0, 0, 1, 0, 1, 1, 1, 1] # Imbalanced target variable
})
X = df[['Age', 'Salary']]
y = df['Purchased']
# Perform a stratified split to maintain the class distribution
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
# Split the remaining data into validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
print("Training set:")
print(y_train.value_counts())
print("\nValidation set:")
print(y_val.value_counts())
print("\nTest set:")
print(y_test.value_counts())
4. Best Practices for Data Splitting
- Avoid Data Leakage: Never use the test data for training or validation. The test set should be reserved for final evaluation.
- Stratified Splitting for Imbalanced Data: If your target variable is imbalanced, always use stratified splitting to ensure that the training, validation, and test sets have a similar distribution of classes.
- Randomization: Shuffle the dataset before splitting to avoid any ordering biases, especially if the data is sorted.
- Consider Cross-Validation: For small datasets, consider using k-fold cross-validation to get a more robust estimate of model performance.
- Train-Test Split Ratio: A typical ratio is 80% for training, 10% for validation, and 10% for testing. This may vary depending on the dataset size.
5. Conclusion
Proper data splitting is essential for building reliable machine learning models that generalize well to unseen data. By splitting the data into training, validation, and test sets, you can ensure that your model is evaluated correctly, and that hyperparameters are tuned without introducing bias. Always keep the test set aside for final evaluation, and use techniques like k-fold cross-validation or stratified splitting to improve the robustness of your model evaluation.