Data Wrangling – Data Analysis with Python

In this blog, we will be discussing data wrangling or data preprocessing. In the previous blog on data analysis with Python, we talked about understanding data. Data that form the basis of everything related to machine learning and artificial intelligence. We discussed how to make sense of data, import data, access data from database, and some useful Python packages for these purposes. Today we will talk about the preparation of data for further analysis.

Data wrangling or data preprocessing is the process of converting data from raw form to some format appropriate for further processing. The process is often called data cleaning too.

Let’s take a look at methods that form the process of data preprocessing.

Missing Values:

When a record has no data value for a feature we say that the value is missing. Missing values in the dataset are often shown as 0, question mark, or a blank space. Sometimes NaN (Not a number) also appears in place of a missing value.

The following three approaches are adopted to handle missing values irrespective of any programming language. Each scenario requires a different approach and thus it should be chosen accordingly.

The first approach involves collecting the missing values from the original source. The person collecting the data in some way tries to collect original values that are missed in the dataset.

The second approach skips the records with the missing values. Dropping data can include dropping the whole feature or only record with missing values. If the feature has not many missing values the best approach would be removing only records with missing data values. We will go for an approach having the least impact on the whole data.

The third approach is replacing data in place of missing values. Using this method data is saved but there is a possibility of introducing inaccurate values since we are entering them based on guesses. Although these are guesses based on some reasoning and calculations but still not original values. One technique for replacing data is entering the average value of the entire feature column in place of missing values. If the column contains non-numerical values or categorical data then we can choose a value that appears most commonly, often called mode.

In our dataset of cars that we were using in the previous blog, missing values for normalized losses columns can be replaced through an average value whereas, in fuel type attribute, mode of the column can help with the missing data values.

Some other knowledge about the dataset can also help in filling missing values. Like if the data collectors know that normalized loss for older cars is more than that of new cars then this information can help them in replacing missing values depending upon the car age.

Lastly, sometimes missing values are left as it is. For some reason, we may find keeping the records as it is even if they have a missing feature.

Now, let’s see how we can drop and replace values in our dataset using Python.

Data Wrangling - Data Analysis with Python

dropna() is a built-in method in the Pandas library that removes records with missing data values. Using dropna() the whole feature from the dataset can be removed and also specific records with missing data values can be dropped. Argument axis = 0 tells that rows are to be removed and axis = 1 tells to remove the whole column. Argument istrue makes changes to the data frame directly.

In our dataset of cars, the price attribute has some missing values. Since this attribute is our predictor feature we can’t delete it. So, we drop the rows that have missing price data values.

replace() is another built-in Pandas method to replace data values in the records. In the following code, we are replacing missing values in the normalized-losses attribute by the average of the whole normalized-losses attribute data values.

Data Wrangling - Data Analysis with Python

Pandas website is a good place to know more about the Pandas methods in detail.

Data Formatting:

Data is often collected from different sources. This means it is available in different formats. Data formatting includes converting data to a single format so that meaningful analysis can be done. Data formatting in the process of data wrangling makes data consistent and easy to understand.

For example, in the car’s dataset, there is a column city miles per gallon which tell about fuel consumption per miles in the unit of a gallon. Now, there is a possibility that we want to convert these data values to a metric version. For this, we will divide 235 by city-miles per gallon value for every record.

This can be done as follows using Python,

There is a possibility that while importing data, values are loaded in unexpected datatypes. It is a good approach to analyze the datatypes for each feature and change them to the appropriate datatype where ever needed. This will prevent inaccurate future analysis of the data and strange working of the overall data model.

dtype() method of the data frame in Pandas checks the data type of each feature in the dataset. astype() method changes datatype from one to another. Like from int to float in case of price attribute. We discussed these methods in the previous blog.

Data Normalization:

Normalizing data means making data values consistent over a range. This normalization makes the statistical analysis of data much easier. It also ensures that every feature has an equal impact on the model. Moreover, it helps in computation. Suppose we have two features. Age and income. Data values for the feature age range from 0 to 100 and income values range from 0 to 20,000. Both attributes are spread over greatly different values. When we train our model like on linear regression algorithm, the income feature will automatically have an increased impact although it is not necessarily an important predictor. But due to large values, it will influence the model. To avoid such an impact of one feature we can normalize the values of all the features in a specific range. For example, we can normalize age and income between the range of 0 and 1. In this way, both the features will have equal effect on the model.

There are many techniques to normalize data. Let’s discuss a few of them.

Simple Feature Scaling technique divides each feature value with the maximum of the feature value. This makes the feature values range between 0 and 1.

Min-Max method takes each value, subtract it with the minimum feature value, and then divides by the feature’s total range. In this way, the feature range is again between 0 and 1.

Z-score or Standard score subtracts the average value of the feature from each feature value and then divides it by the standard deviation of the feature.

The average is given by mu, µ, and the standard deviation is given by sigma. The resultant feature range is between -3 and 3.

Let’s apply these feature normalizing method to the length feature in our car’s dataset. In Python, it is done as follows,

Data Wrangling - Data Analysis with Python

Data Binning:

Next in data wrangling let’s talk about Data Binning. Binning is grouping together data values into ‘bins’ or categories. For example, the task difficulty level can be binned as easy, medium, and hard. Age values can be binned in ranges such as 0-15, 16-18, and so on. Binning data values like this helps in a better understanding of data and also sometimes increases data accuracy.

In our car’s data set, the price ranges from 5188 to 45400 with 201 unique values. We can apply the binning technique with low, medium, and high priced cars.

In python, we will perform this binning by taking 4 numbers that are equally distanced as dividers, as we need 3 bins of equal bin-width.

The function ‘linspace‘ will return 4 numbers, equally-spaced over a range of price data values.

Pandas function ‘cut‘ segments and cuts the price data into these 3 bins.

Data Wrangling - Data Analysis with Python

We can also plot the price data values to visualize the distribution of price over the three bins.

Data Wrangling - Data Analysis with Python
Data Wrangling - Data Analysis with Python

Converting Categorical Variables to Quantitative Variables:

Mostly, statistical models take numerical values as inputs instead of strings and variables. Let’s see how to convert data values that are strings and objects into quantitative values in Python.

We will take the ‘fuel-type’ feature in our car’s dataset that has two values of gas and diesel. We will make new features corresponding to the unique values of the feature we want to encode as a quantitative variable. The feature fuel-type has two unique values, gas and diesel. We will make two new feature columns as gas and diesel. Whatever value comes in our dataset we set the corresponding feature as 1 and the rest as 0. This technique is also known as ‘one-hot encoding’.

Pandas ‘get_dummies()’ method converts the categorical feature into dummy variables.

Wrap-Up:

In this blog on data wrangling, we discussed data preprocessing. Methods involved in data wrangling or data cleaning such as handling missing values, data formatting, data normalizing, etc. helps in further advanced analysis and processing of data. In a nutshell, we are now familiar with a pre-requisite for data analysis that will help us in using data effectively.

What do you think? Tell in comments! If you like what you read, share it! 🙂

One thought on “Data Wrangling – Data Analysis with Python

Leave a Reply

Your email address will not be published. Required fields are marked *