Used Car Price Predictor (EDA and Model )

 

Exploratory Data Analysis (EDA) and Model Building for Used Car Price Prediction

With the dataset cleaned and preprocessed, it's time to take the next step—Exploratory Data Analysis (EDA) and model building. In this blog, we will begin by exploring the dataset to uncover valuable insights and identify relationships between key features, such as year, mileage, fuel type, and brand, and their impact on car prices.

Through visualizations and statistical analysis, we'll identify trends and correlations that will guide the construction of a machine learning model. After the EDA, we'll move on to building a predictive model that can accurately estimate the price of a used car based on its attributes.

Exploratory Data Analysis

Let's create a boxplot to visualize the distribution of used car prices for each company in the dataset:

import matplotlib.pyplot as plt
import seaborn as sns
plt.subplots(figsize=(15,7))
ax=sns.boxplot(x='company',y='Price',data=cars)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()


Price Variability: Some companies have a high range of car prices, while others may
have a narrower range.

Let's create a stripplot to visualize the relationship between the year of the car and
its price in the dataset:

plt.subplots(figsize=(20,10))
ax=sns.stripplot(x='year',y='Price',data=cars)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()

Price Trends: You can observe how prices tend to change over the years, helping to
identify whether newer cars generally have higher prices.
Data Clusters: You might see certain years with more concentrated data points or other
years with sparse distribution.

Let's create a scatter plot to examine the relationship between kilometers driven
(kms_driven) and price (Price) of the cars in the dataset:

sns.relplot(x='kms_driven',y='Price',data=cars,height=7,aspect=1.5)
plt.show()

Price and Mileage Relationship: It is likely that cars with fewer kilometers driven will
have higher prices, as they are perceived to be less used.
Spread: A large spread of data points might indicate a less clear or complex relationship,
suggesting further feature engineering may be needed.

Let's create a boxplot to examine the distribution of car prices based on fuel type in
the dataset:

plt.subplots(figsize=(14,7))
sns.boxplot(x='fuel_type',y='Price',data=cars)

Price Distribution by Fuel Type: The boxplot reveals how prices are distributed across
different fuel types. If one fuel type has a wider range or higher median prices, it
could indicate that certain fuel types are more expensive on average.

Model Building
X = cars.drop(columns = 'Price')
Y = cars['Price']

Ee prepared the dataset for model training by separating the features and the target
variable. The features (independent variables), which include attributes such as company,
year, kms_driven, and fuel_type, were stored in the variable X. The target variable
(dependent variable), which is the car's price, was assigned to the variable Y.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2)

To prepare for model training, we split the dataset into training and testing sets using
the train_test_split function from Scikit-learn. The data was divided so that 80% of the
data was used for training the model (X_train and y_train), while the remaining 20% was
set aside for testing and evaluating the model’s performance (X_test and y_test). This
ensures that we can assess the model’s ability to generalize to unseen data.

Importing the dependencies:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit((X[['name', 'company', 'fuel_type']]))

column_trans=make_column_transformer((OneHotEncoder(categories=ohe.categories_),
['name','company','fuel_type']),
                                    remainder='passthrough')

To handle categorical variables effectively, we used OneHotEncoder from

Scikit-learn. The OneHotEncoder was applied to the name, company, and fuel_type

columns in the dataset, which contain categorical data. This encoding transforms

the categorical variables into a format that can be used by machine learning models

by creating binary columns for each unique category.

Next, we used make_column_transformer to apply the encoding only to the specified

columns (name, company, and fuel_type) while leaving the remaining columns

untouched. This ensures that the categorical features are properly transformed,

while the numeric features remain as they are, making the data ready for model

training.

Objectifying the functions:
scaler = StandardScaler()
lr = LinearRegression()
la = Lasso()
ri = Ridge()
xgb = XGBRegressor()
rfr = RandomForestRegressor()
ada = AdaBoostRegressor()
gbr = GradientBoostingRegressor()
dt = DecisionTreeRegressor()

Creating the dictionary of the above objects:

Regressors = {
    'Linear Regression' : lr,
    'Lasso' : la,
    'Ridge': ri,
    'XGBRegressor': xgb,
    'RandomForestRegressor': rfr,
    'AdaBoostRegressor': ada,
    'gradient Boost Regressor': gbr,
    'Decision Tree Regressor': dt
}

Applying the for loop to run the algorithms shown above to find the best model:
for name,model in Regressors.items():
    pipe = make_pipeline(column_trans,model)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print("For ",name)
    print("R2 Score ",r2_score(y_test,y_pred))
    print("MSE ",mean_squared_error(y_test,y_pred))
    print("==================================================")

And these are the results:
For Linear Regression R2 Score 0.7687202450532246 MSE 33970100109.577896 ==================================================
For Lasso R2 Score 0.7495111031029826 MSE 36791516429.47577 ================================================== For Ridge R2 Score 0.15231305241746917 MSE 124507268167.88583 ================================================== For XGBRegressor R2 Score 0.7599780559539795 MSE 35254142332.93068 ================================================== For RandomForestRegressor R2 Score 0.8296277674457289 MSE 25024074403.281742 ================================================== For AdaBoostRegressor R2 Score 0.28812190514685254 MSE 104559822599.00856 ================================================== For gradient Boost Regressor R2 Score 0.7866198557566625 MSE 31341026208.759438 ================================================== For Decision Tree Regressor R2 Score 0.6396406352374719 MSE 52929162343.77761 ==================================================

As the Random Forest regressor has the highest R2_score, that model has been selected and

a pipeline has been made:

pipe = make_pipeline(column_trans, rfr)
pipe.fit(X_train, y_train)

The pipeline has been dumped using pickle module to be loaded in web application to make

predictions:

import pickle
pickle.dump(pipe,open("Model.pkl",'wb'))

Conclusion

In this blog, we successfully completed the Exploratory Data Analysis (EDA) and built a

predictive model for used car price estimation. By visualizing the relationships between

various features such as the car's company, year, fuel type, and kilometers driven, we gained

valuable insights into how these factors influence car prices.

Through EDA, we identified trends, price distributions, and potential outliers, which helped

in preparing the data for model building. We then used a variety of regression models,

including Linear Regression, Lasso, Ridge, Random Forest, XGBoost, and others, to predict car

prices. After evaluating model performance, the Random Forest Regressor was chosen as the

best-performing model, achieving the highest R² score and the lowest Mean Squared Error (MSE).

Finally, we wrapped the model into a pipeline and saved it using the pickle module, making it

ready for deployment in a web application where it can predict used car prices based on user

inputs.The next steps would involve deploying this model as a functional web application,

where users can input car features and receive accurate price predictions.


Comments