Model development in Machine Learning using Python

In this blog, we are going to talk about model development for a machine learning problem. For this purpose, we are going to use a car dataset, the same we used in our previous blogs for data analysis with Python, and develop a model to predict the price of a car. We will go through different modeling methods like linear regression, multiple linear regression, (polynomial regression – in the next part), and also model visualization and evaluation.

What is a Model in Machine Learning?

A model in ML can be seen as a mathematical equation used to predict a value, given one or more other values. A model is also known as an estimator.

It relates one or more independent features or variables to dependent variables.

For example, we can consider a car’s distance in miles given oil in gallons as the independent variable and make the model predict output as a dependent variable, that is the car’s price.

The more relevant data we have as an input to the model, the more accurate model we have. For this purpose, we can also input multiple relevant features.

Let’s try to understand with an example. We know from statistics that cars in pink colors sell for significantly less prices. If we do not take color as an independent variable, our model will predict the same price for cars which in actual sell for less price. So, the more accurate and useful data we have, the more accurate is the prediction of the model.

Now, let’s talk about different model development methods.

Simple Linear Regression:

This linear regression refers to one independent variable to predict the dependent variable.

SLR helps us understand the relationship between two variables. The predictor independent variable x and a dependent target variable y. We will try to build a linear relationship between the two.

Here, b_0 is the intercept and b_1 is the slope. When we train the model with the data, we come up with these parameters.

We can get the values of highway miles per gallon from the user manual and then predict the price of the car from this data. We assume there is a linear relationship between these variables to build a model for the prediction of cars’ prices.

To build this straight line, we take data points from the dataset as shown by red dots. We then use this data to train or fit the model. The result of the trained model is the parameters.

When using python for model development, we store these data points into NumPy arrays or data frames. The target or to be predicted values are stored in an array. The independent values are stored in a data frame or an array.

Each sample is representing a different row in each data frame or an array.

Uncertainty in a model:

Many factors influence a car’s price. Like make or how old is a car. Such uncertainty can be taken into account by adding a small random value to each data point on the line. This is noise in the model.

In the above distribution, the y-axis shows the value-added, and the x-axis shows the probability the value will be added. Sometimes the value is positive, sometimes negative. Sometimes large but most of the time small value, near to zero, is added.

To summarize, the model development follows these steps:

1. A set of data points is taken as training data
2. Model is trained or fit on this training data to get parameters
3. These parameters are then used in the model to make predictions.
4. Model is now used to predict unseen values.

Usually, the predicted value is different than the actual value. For 20 miles per gallon, we don’t have the value in the training data. But we have a value for 10 miles per gallon. We see that the predicted value is different than the actual value. If the linear assumption is believed to be true, this is because of noise. But there can be other reasons.

Following are the coding steps to fit a linear regression model in Python.

1. Import linear regression model from Sklearn. Create a linear regression object using a constructor.
2. Define target and predictor variables.
3. Use the fit method to fit the model and get parameters, b_0, and b_1.

The inputs are the features and targets. Prediction is done using the predict method. The output array has the same number of samples as the input.

The intercept b_0 and slope b_1 are attributes of object lm.

The relationship between price and miles per gallons are given by an equation:

Multi Linear regression:

Next in model development in machine learning is multi linearvregression. This model is developed to give a relationship between one continuous target variable and one or more independent predictor variables. For 4 predictor variables the relationship equation as follows;

Say, we have 2 predictor variables. We can visualize them in a 2-D space. Each value of two predictor variables x_1 and x_2 will be mapped to y and y-hat. The new values of y and y-hat are mapped in vertical direction with height equals to the value y-hat is.

Following coding steps train a multi linear regression model.

1. Store predictor variables in some variable z.
2. Train the model using the train method using feature variables.
3. Use predict method to get predictions.

In this case input is an array or a data frame with four columns. The number of rows corresponds the number of samples. The output is in the form of an array with the number of elements same as the number of training samples.

The intercept and coefficients are the attributes of linear model object.

Model Evaluation:

Now, let’s visualize our model to evaluate it better. Important part of model development in machine learning is its evaluation. Regression plots tells us about the relationship of the variables. There correlation and strength. The horizontal axis is the independent variable and the vertical axis the dependent variable. Each point is a target point. The fitted line is the predicted value.

Below is the implementation of a regression plot using a Seaborn library.

The residual plot gives the difference between the actual value and the predicted value as an error. Plot that value on vertical axis and dependent variable on the horizontal axis. Process is repeated for all predicted values.

The plot shows values evenly distributed around the x-axis, zero mean, same variance, and no curvature. This residual plot sows that our linear model is developed appropriately.

Above residual plot shows a curvature.  The error values change x. In the beginning are positive, then negative, in the end difference is large. This residual plot shows that our assumption of linear relation is incorrect. The plot indicates a non-linear relation.

Residual plot can be implemented as follows using a Seaborn library.

Distribution plot gives the count of predicted value against an actual value. These plots are usually helpful with more than one independent variable.

On the plot above, we examine the y-axis. We first count the predicted values equal to one and plot them. Then the predicted values equal to two are counted and plotted. Process is repeated for all predicted values. Same process is then repeated for target values. Target values in this case are almost equal to two.

The predicted and target values are continuous and histogram is for discrete values. Therefore, Pandas convert them to a distribution.

Using Seaborn, distribution plot is made as follows.

So, this is it for linear regression model development in machine learning. We learned how to implement simple and multiple linear regression models using python and also how to visualize them for better evaluation.