Used Car Price Predictor (EDA and Model )
Exploratory Data Analysis (EDA) and Model Building for Used Car Price Prediction
With the dataset cleaned and preprocessed, it's time to take the next step—Exploratory Data Analysis (EDA) and model building. In this blog, we will begin by exploring the dataset to uncover valuable insights and identify relationships between key features, such as year, mileage, fuel type, and brand, and their impact on car prices.
Through visualizations and statistical analysis, we'll identify trends and correlations that will guide the construction of a machine learning model. After the EDA, we'll move on to building a predictive model that can accurately estimate the price of a used car based on its attributes.
Exploratory Data Analysis
Let's create a boxplot to visualize the distribution of used car prices for each company in the dataset:
kms_driven) and price (Price) of the cars in the dataset:company,year, kms_driven, and fuel_type, were stored in the variable X. The target variable Y.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2)
To prepare for model training, we split the dataset into training and testing sets using the train_test_split function from Scikit-learn. The data was divided so that 80% of the data was used for training the model (X_train and y_train), while the remaining 20% was set aside for testing and evaluating the model’s performance (X_test and y_test). This ensures that we can assess the model’s ability to generalize to unseen data.
Importing the dependencies:from sklearn.linear_model import LinearRegression, Lasso, Ridgefrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sklearn.compose import make_column_transformerfrom sklearn.pipeline import make_pipelinefrom xgboost import XGBRegressorfrom sklearn.linear_model import LinearRegression, Lasso, Ridgefrom sklearn.ensemble import RandomForestRegressorfrom sklearn.ensemble import AdaBoostRegressorfrom sklearn.ensemble import GradientBoostingRegressorfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.metrics import r2_score, mean_squared_errorfrom sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()ohe.fit((X[['name', 'company', 'fuel_type']]))
column_trans=make_column_transformer((OneHotEncoder(categories=ohe.categories_), ['name','company','fuel_type']), remainder='passthrough')To handle categorical variables effectively, we used OneHotEncoder from
Scikit-learn. The OneHotEncoder was applied to the name, company, and fuel_type
columns in the dataset, which contain categorical data. This encoding transforms
the categorical variables into a format that can be used by machine learning models
by creating binary columns for each unique category.
Next, we used make_column_transformer to apply the encoding only to the specified
columns (name, company, and fuel_type) while leaving the remaining columns
untouched. This ensures that the categorical features are properly transformed,
while the numeric features remain as they are, making the data ready for model
training.
Objectifying the functions:scaler = StandardScaler()lr = LinearRegression()la = Lasso()ri = Ridge()xgb = XGBRegressor()rfr = RandomForestRegressor()ada = AdaBoostRegressor()gbr = GradientBoostingRegressor()dt = DecisionTreeRegressor()Creating the dictionary of the above objects:
Regressors = { 'Linear Regression' : lr, 'Lasso' : la, 'Ridge': ri, 'XGBRegressor': xgb, 'RandomForestRegressor': rfr, 'AdaBoostRegressor': ada, 'gradient Boost Regressor': gbr, 'Decision Tree Regressor': dt}
Applying the for loop to run the algorithms shown above to find the best model:for name,model in Regressors.items(): pipe = make_pipeline(column_trans,model) pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) print("For ",name) print("R2 Score ",r2_score(y_test,y_pred)) print("MSE ",mean_squared_error(y_test,y_pred)) print("==================================================")
And these are the results:For Linear Regression
R2 Score 0.7687202450532246
MSE 33970100109.577896
==================================================For Lasso
R2 Score 0.7495111031029826
MSE 36791516429.47577
==================================================
For Ridge
R2 Score 0.15231305241746917
MSE 124507268167.88583
==================================================
For XGBRegressor
R2 Score 0.7599780559539795
MSE 35254142332.93068
==================================================
For RandomForestRegressor
R2 Score 0.8296277674457289
MSE 25024074403.281742
==================================================
For AdaBoostRegressor
R2 Score 0.28812190514685254
MSE 104559822599.00856
==================================================
For gradient Boost Regressor
R2 Score 0.7866198557566625
MSE 31341026208.759438
==================================================
For Decision Tree Regressor
R2 Score 0.6396406352374719
MSE 52929162343.77761
==================================================As the Random Forest regressor has the highest R2_score, that model has been selected and
a pipeline has been made:
pipe = make_pipeline(column_trans, rfr)pipe.fit(X_train, y_train)The pipeline has been dumped using pickle module to be loaded in web application to make
predictions:
import picklepickle.dump(pipe,open("Model.pkl",'wb'))
ConclusionIn this blog, we successfully completed the Exploratory Data Analysis (EDA) and built a
predictive model for used car price estimation. By visualizing the relationships between
various features such as the car's company, year, fuel type, and kilometers driven, we gained
valuable insights into how these factors influence car prices.
Through EDA, we identified trends, price distributions, and potential outliers, which helped
in preparing the data for model building. We then used a variety of regression models,
including Linear Regression, Lasso, Ridge, Random Forest, XGBoost, and others, to predict car
prices. After evaluating model performance, the Random Forest Regressor was chosen as the
best-performing model, achieving the highest R² score and the lowest Mean Squared Error (MSE).
Finally, we wrapped the model into a pipeline and saved it using the pickle module, making it
ready for deployment in a web application where it can predict used car prices based on user
inputs.The next steps would involve deploying this model as a functional web application,
where users can input car features and receive accurate price predictions.




Comments
Post a Comment