Used Car Price Predictor (EDA and Model )

Exploratory Data Analysis (EDA) and Model Building for Used Car Price Prediction

With the dataset cleaned and preprocessed, it's time to take the next step—Exploratory Data Analysis (EDA) and model building. In this blog, we will begin by exploring the dataset to uncover valuable insights and identify relationships between key features, such as year, mileage, fuel type, and brand, and their impact on car prices.

Through visualizations and statistical analysis, we'll identify trends and correlations that will guide the construction of a machine learning model. After the EDA, we'll move on to building a predictive model that can accurately estimate the price of a used car based on its attributes.

Exploratory Data Analysis

Let's create a boxplot to visualize the distribution of used car prices for each company in the dataset:

import matplotlib.pyplot as plt

import seaborn as sns

plt.subplots(figsize=(15,7))

ax=sns.boxplot(x='company',y='Price',data=cars)

ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')

plt.show()

Price Variability: Some companies have a high range of car prices, while others may 

have a narrower range.

Let's create a stripplot to visualize the relationship between the year of the car and 

its price in the dataset:

plt.subplots(figsize=(20,10))

ax=sns.stripplot(x='year',y='Price',data=cars)

ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')

plt.show()

Price Trends: You can observe how prices tend to change over the years, helping to 

identify whether newer cars generally have higher prices.

Data Clusters: You might see certain years with more concentrated data points or other 

years with sparse distribution.

Let's create a scatter plot to examine the relationship between kilometers driven 

(kms_driven) and price (Price) of the cars in the dataset:

sns.relplot(x='kms_driven',y='Price',data=cars,height=7,aspect=1.5)
plt.show()

Price and Mileage Relationship: It is likely that cars with fewer kilometers driven will 

have higher prices, as they are perceived to be less used.

Spread: A large spread of data points might indicate a less clear or complex relationship,

suggesting further feature engineering may be needed.

Let's create a boxplot to examine the distribution of car prices based on fuel type in 

the dataset:

plt.subplots(figsize=(14,7))
sns.boxplot(x='fuel_type',y='Price',data=cars)

Price Distribution by Fuel Type: The boxplot reveals how prices are distributed across 
different fuel types. If one fuel type has a wider range or higher median prices, it 
could indicate that certain fuel types are more expensive on average.

Model Building
X = cars.drop(columns = 'Price')
Y = cars['Price']

Ee prepared the dataset for model training by separating the features and the target 
variable. The features (independent variables), which include attributes such as company,
year, kms_driven, and fuel_type, were stored in the variable X. The target variable 
(dependent variable), which is the car's price, was assigned to the variable Y.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2)

To prepare for model training, we split the dataset into training and testing sets using 
the train_test_split function from Scikit-learn. The data was divided so that 80% of the 
data was used for training the model (X_train and y_train), while the remaining 20% was 
set aside for testing and evaluating the model’s performance (X_test and y_test). This 
ensures that we can assess the model’s ability to generalize to unseen data.

Importing the dependencies:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit((X[['name', 'company', 'fuel_type']]))

column_trans=make_column_transformer((OneHotEncoder(categories=ohe.categories_),
                                     ['name','company','fuel_type']),
                                    remainder='passthrough')
To handle categorical variables effectively, we used OneHotEncoder from 
Scikit-learn. The OneHotEncoder was applied to the name, company, and fuel_type 
columns in the dataset, which contain categorical data. This encoding transforms 
the categorical variables into a format that can be used by machine learning models
by creating binary columns for each unique category.
Next, we used make_column_transformer to apply the encoding only to the specified 
columns (name, company, and fuel_type) while leaving the remaining columns 
untouched. This ensures that the categorical features are properly transformed, 
while the numeric features remain as they are, making the data ready for model 
training.
Objectifying the functions:
scaler = StandardScaler()
lr = LinearRegression()
la = Lasso()
ri = Ridge()
xgb = XGBRegressor()
rfr = RandomForestRegressor()
ada = AdaBoostRegressor()
gbr = GradientBoostingRegressor()
dt = DecisionTreeRegressor()
Creating the dictionary of the above objects:
Regressors = {
    'Linear Regression' : lr,
    'Lasso' : la, 
    'Ridge': ri,
    'XGBRegressor': xgb,
    'RandomForestRegressor': rfr, 
    'AdaBoostRegressor': ada, 
    'gradient Boost Regressor': gbr, 
    'Decision Tree Regressor': dt
}

Applying the for loop to run the algorithms shown above to find the best model:
for name,model in Regressors.items():
    pipe = make_pipeline(column_trans,model)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print("For ",name)
    print("R2 Score ",r2_score(y_test,y_pred))
    print("MSE ",mean_squared_error(y_test,y_pred))
    print("==================================================")

And these are the results:
For  Linear Regression
R2 Score  0.7687202450532246
MSE  33970100109.577896
==================================================
For  Lasso
R2 Score  0.7495111031029826
MSE  36791516429.47577
==================================================
For  Ridge
R2 Score  0.15231305241746917
MSE  124507268167.88583
==================================================
For  XGBRegressor
R2 Score  0.7599780559539795
MSE  35254142332.93068
==================================================
For  RandomForestRegressor
R2 Score  0.8296277674457289
MSE  25024074403.281742
==================================================
For  AdaBoostRegressor
R2 Score  0.28812190514685254
MSE  104559822599.00856
==================================================
For  gradient Boost Regressor
R2 Score  0.7866198557566625
MSE  31341026208.759438
==================================================
For  Decision Tree Regressor
R2 Score  0.6396406352374719
MSE  52929162343.77761
==================================================
As the Random Forest regressor has the highest R2_score, that model has been selected and
a pipeline has been made:
pipe = make_pipeline(column_trans, rfr)
pipe.fit(X_train, y_train)
The pipeline has been dumped using pickle module to be loaded in web application to make 
predictions:
import pickle
pickle.dump(pipe,open("Model.pkl",'wb'))

Conclusion
In this blog, we successfully completed the Exploratory Data Analysis (EDA) and built a 
predictive model for used car price estimation. By visualizing the relationships between 
various features such as the car's company, year, fuel type, and kilometers driven, we gained 
valuable insights into how these factors influence car prices.
Through EDA, we identified trends, price distributions, and potential outliers, which helped 
in preparing the data for model building. We then used a variety of regression models,
including Linear Regression, Lasso, Ridge, Random Forest, XGBoost, and others, to predict car 
prices. After evaluating model performance, the Random Forest Regressor was chosen as the 
best-performing model, achieving the highest R² score and the lowest Mean Squared Error (MSE).
Finally, we wrapped the model into a pipeline and saved it using the pickle module, making it 
ready for deployment in a web application where it can predict used car prices based on user 
inputs.The next steps would involve deploying this model as a functional web application, 
where users can input car features and receive accurate price predictions.

Search This Blog

Machine Learning Projects

Used Car Price Predictor (EDA and Model )

Exploratory Data Analysis (EDA) and Model Building for Used Car Price Prediction

Comments

Post a Comment