Train, Test And Validate Datasets In Machine Learning

The aim of this article is to help you understand the difference between testing, training and validating machine learning datasets.

Training Dataset

It is the dataset that we use to train an ML model. The model sees and learns from the training dataset.

Validation Dataset

The validation set is used to evaluate a particular model. This data is used by machine learning engineers to fine-tune the model’s hyperparameters. As a result, the model encounters this data on occasion, but never “learns” from it. The validation set findings are used to update the hyperparameters. Thus, the validation set influences a model indirectly.

Test Dataset

The test machine learning dataset serves as the gold standard for evaluating the model. It is only utilized when a model has been properly trained (using the validation and train sets). In most cases, the test set is utilized to compare rival models. In general, the test set is well-curated. It provides properly sampled data spanning the numerous classes that the model might face in the real world.

machine learning datasets

Importance Of Splitting

Supervised machine learning algorithm is about creating precise models which predict the target variable consistently with the inputs given to the model.

Now, carrying on with our learning related to machine learning datasets, there are multiple ways to measure the precision of your model. It depends on the kind of problem you are trying to solve. For regression, we may look at RMSE, absolute error, etc. For classification, we may look at precision, recall, etc.

We usually need unbiased evaluation to measure these properly, assess and validate the predictive performance. This means we cannot evaluate the predictive performance of the model with the same data that is used for training, hence we need fresh data that hasn’t been fed to the model. This can be accomplished by splitting the dataset.

How To Split Dataset Into Validation, Test And Train

We can simply use SKlearn’s module model_selection.train_test_split twice.

  • First, let’s split the data into train set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  • Second, split the train dataset again into train and validation

X_train, X_val, y_train, y_val  = train_test_split(X_train, y_train, test_size=0.25, random_state=42) (0.25 x 0.8 = 0.2)

Another Way To Split Dataset

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

This will produce a 60%, 20%, 20% for training, test & validation, sets.

Summary

In this article you learned how to utilize SKlearn’s train test split(). You also learned that using data that hasn’t been used for model fitting is the best way to get an impartial estimate of the prediction performance of a machine learning model. Therefore, you must divide your dataset into different subsets.


References