Data-Science Series - Data Pre-processing tasks using python

Thakkar Vedang
Oct 30, 2021
3 min read

Because today’s datasets are so detailed, adding more characteristics to the model makes it more complex, and the model may end up overfitting the data. Some features can cause noise, which could harm the model.

The model may generalise better if those irrelevant characteristics are removed. Various feature selection strategies were given on the SkLearn website. On the same data set, we will compare the performance of several feature selection approaches.

The ‘Iris’ dataset from the sklearn.datasets package was used to do data reduction.

There are four distinct characteristics in the data. We add additional noisy features to the data set to see how successful different feature selection approaches are.

There are now 14 features in the dataset. We must first separate the data before implementing the feature selection approach. The reason for this is that we only choose features based on data from the training set, not the entire data set. To evaluate the success of the feature selection and the model, we should put aside a portion of the entire data set as a test set. As a result, the information from the test set is hidden while we choose features and train the model.

The Variance Threshold technique to feature selection is a basic baseline strategy. All characteristics whose variance does not reach a certain threshold are removed. It eliminates all zero-variance features by default. Because our dataset lacks a zero variance feature, our results are unaffected.

Univariate Feature Selection

The best features are chosen using univariate statistical tests in univariate feature selection.
To evaluate if there is a statistically significant association between each attribute and the target variable, we compare them.
We disregard the other characteristics while analysing the link between one feature and the target variable. That is why it is referred to as “univariate.”
Each feature has its own score on the exam.
Finally, all of the test results are compared, and the characteristics with the highest scores are chosen.

Recursive feature elimination (RFE) selects features by recursively examining smaller and smaller sets of features, given an external estimator that gives weights to features (e.g., the coefficients of a linear model). The estimator is first trained on the original set of features, and the significance of each feature is determined using either the coef_ or feature importance attributes.

Differences Between Before and After Using Feature Selection

Before :

After:

There are clear differences in precision, recall, f1-score and accuracy in both outputs. This shows the importance of using feature selection to increase performance of the model.

Principal Component Analysis (PCA)

By altering the optimization technique, we can speed up the fitting of a machine learning system. Principal Component Analysis is a more prevalent method of speeding up a machine learning system (PCA).

It helps to be able to see your data in a number of machine learning applications. It’s not difficult to visualise data in two or three dimensions. The Iris dataset utilised in this study is four-dimensional. We’ll use PCA to compress the data from four dimensions to two or three dimensions so you can plot it and perhaps understand it better.

So, now lets execute PCA for visualization on Iris Dataset.

PCA to 2D Projection:

There are four columns in the original data (sepal length, sepal width, petal length, and petal width). The code in this part converts four-dimensional data into two-dimensional data. The two main axes of variation are represented by the new components.

In this blog, I compared and contrasted the results of various feature selection approaches on the same data.

When all of the features are used to train the model, the model performs better when only the remaining features are used following feature selection.

Following feature selection, PCA was used to visualise the dataframe in 2D and 3D with decreased components.

THAKKAR VEDANG
Programmer Analyst
Meditab Group

Data-Science Series - Data Pre-processing tasks using python

Recent Posts

Comments