Pre-processing the Data in Machine Learning | Read Now

Data preparation or its pre-processing is the procedure for organizing original data for usage in a Machine Learning system. It’s perhaps the most important stage in and the first one for building a ML model.

This is not always the situation that we come upon cleaning and preparing data when engaging on a ML project. And, before performing any data-related function, it is great for cleaning the information and reformat it. As a response, we utilize a data-preprocessing activity for all this.

What is the need for pre-processing in Machine Learning?

  • Actual data sometimes comprises noise, null values, and is in an unacceptable format that cannot be used properly in ML models.
  • Pre-processing  is a necessary task for purifying data to make it acceptable for a classification model, which boosts the system’s performance and reliability.
  • The steps are described:
    1. Collect data
    2. Import the necessary libraries and packages
    3. Import the desired data
    4. Find and handle the missing or null values
    5. Encode the categorical quantities
    6. Split the whole data into train and test sets
    7. Scaling the features

1] Collect data

  • The initial requirement for constructing a ML model is a database, as a model is entirely backed up by evidence. The dataset is composed of information in a certain manner for a specific issue.
  • For illustration, if we really want to construct a model for company goals, the information will be unique from the dataset required for a kidney patient.
  • As a sense, every database is distinctive from others.

2] Import the necessary libraries

  • Importing certain preconfigured Python’s libraries is obligated to accomplish data pre-processing.
  • These modules are employed for a multitude of activities. For pre – processing phase, we will apply the following 3 library functions:
    1. Numpy: It is a Python module that permits you to employ any type of arithmetic operations in your script. It is the most useful package for numerical computations.
    2. Matplotlib: The next library is matplotlib, Python’s 2D charting module that involves the import of its subset-library called pyplot. For the software, this library is employed to generate any sorts of visualisation charts in Python.
    3. Pandas: The final library is Pandas, which is one of most well-known Libraries for importing and maintaining databases.

3] Import the desired data

  • Now it’s time to import the datasets we’ve obtained for our ML research.
  • However, before we could import a database, we must make the existing directory a functioning path.
  • To import the information, we’ll employ the pandas library’s read csv() operation, which reads a csv file and executes different actions on it.
  • We may read a csv file directly and also via a URL utilizing this function.
  • You can download and implement the csv files from these resources:
    1. Kaggle
    2. UCI ML Repository

4] Find and handle the missing or null values

  • The implementation stage in the preprocessing step is to deal with incomplete information in the databases.
  • If any of the information in our database is lacking, it could be a major challenge for our trained model.
  • As a response, handling empty attribute values is essential.
  • Strategies for tackling with missing values include:
    1. By eliminating a specific row: First approach is typically used to manage with null data. We just remove the relevant column or row which comprises of the null or empty quantities in this manner.
    2. By calculating the average:  We will computer the average or mean of that row or column that comprises any incomplete data and set it in the substitute of the missing or empty quantities in this manner.

5] Encode the categorical quantities

  • Categorical information is data that is divided into categories.
  • For illustration, there are two distinct variables in our set of data: Location and Purchased.
  • Because ML model is based purely on maths and figures, having a categorical type of a variable in our sample could create difficulties while constructing the network.
  • As a response, these categorical values must be transformed to integers. In this stage, we’ll employ the LabelEncoder() function.

6] Split the whole data into train-test sets

  • We split our database into training and validation sets at this point.
  • If we train our model well and its accuracy value is great, but then give it a fresh sample, the model’s performance will suffer negatively.
  • As a response, we always strive to create ML model that works well both with train and test sets.
  • The training set is a portion of the information used to teach the constructed model and the outputs are indeed known.
  • The test set is a portion of the information used to evaluate the constructed model, and the syetm predicts the outcomes employing the validation set.
  • Globally, everyone try to either split their database into 80:20 or 70:30 ratio where 80 or 70 is the training ratio and the 20 or 30 is the testing ratio.

7] Scaling the features

  • In ML, scaling the relevant features is the final stage of pre-processing operation. 
  • It is a strategy for standardising the database’s independent variables in a specified boundary.
  • In feature scalability, we place all of our features in a same ranges and scale them all so that none of them overpower the others.

Leave a Reply