Car Price Predictor (Data Preparation)

Predicting Used Car Prices: A Machine Learning Approach

The used car market is vast and dynamic, with prices influenced by various factors such as brand, model, year, mileage, fuel type, and location. Accurately estimating the price of a used car can be a challenging task for both buyers and sellers who are looking to make informed decisions.

In this blog, I walk you through the process of cleaning and preparing a dataset to predict used car prices using machine learning. By working through various data preprocessing steps—such as handling missing values, converting data types, and dealing with outliers—I ensure that the dataset is ready for Exploratory Data Analysis (EDA). This will allow us to identify patterns and relationships that will help build an effective predictive model.

Loading and Exploring the Data

We start by loading the dataset and understanding its structure:

import pandas as pd

import numpy as np

cars = pd.read_csv('quikr_car.csv')

Data Overview

The first step is to inspect the dataset:

cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        892 non-null    object
 1   company     892 non-null    object
 2   year        892 non-null    object
 3   Price       892 non-null    object
 4   kms_driven  840 non-null    object
 5   fuel_type   837 non-null    object
dtypes: object(6)
memory usage: 41.9+ KB

Issues in the dataset:
Data type mismatches:
The year, Price, and kms_driven columns should be numeric but are stored as objects. 
They need to be converted properly.
Missing values:
kms_driven has 840 non-null values, meaning 52 values are missing.
fuel_type has 837 non-null values, meaning 55 values are missing.

cars.describe()

namecompanyyearPricekms_drivenfuel_type
count892892892892840837
unique52548612742583
topHonda CityMaruti2015Ask For Price45,000 kmsPetrol
freq132351173530440

	name	company	year	Price	kms_driven	fuel_type
count	892	892	892	892	840	837
unique	525	48	61	274	258	3
top	Honda City	Maruti	2015	Ask For Price	45,000 kms	Petrol
freq	13	235	117	35	30	440

cars.isna().sum()

name           0
company        0
year           0
Price          0
kms_driven    52
fuel_type     55
dtype: int64

Data Cleaning

Handling the 'year' column

cars = cars[cars['year'].str.isnumeric()]

This keeps only the rows where the 'year' column contains purely numeric values, 
removing any rows with non-numeric data (e.g., missing values, special characters, 
or incorrect entries).

Convert the 'year' column to an integer using:

cars['year'] = cars['year'].astype(int)

Handling the 'Price' column

cars = cars[cars['Price']!='Ask For Price']

This removes all rows where the 'Price' column contains the text "Ask For Price", 

ensuring that only numeric price values remain in the dataset.

Convert 'Price' to a numeric format by removing unwanted characters (like currency 

symbols) and converting the type:

cars['Price']=cars['Price'].str.replace(',','').astype(int)

Handling the 'kms_driven' column

cars['kms_driven'] = cars['kms_driven'].str.split(" ").str.get(0)

This operation modifies the 'kms_driven' column by:

Extracting only the numeric portion from values like "40,000 km",

keeping just "40,000".

Removing the unit "km", making it easier to convert to a numeric type.

cars['kms_driven'] = cars['kms_driven'].str.replace(',', '')

This operation:

Removes commas from the 'kms_driven' column, converting values like "40,000"

to "40000".

Makes it easier to convert the column to a numeric type for further analysis.

cars = cars[cars['kms_driven'].str.isnumeric()]

This operation:
Keeps only rows where 'kms_driven' contains purely numeric values.
Removes rows with missing values or non-numeric entries (e.g., "Unknown" 
or empty strings).

Convert the 'kms_driven' column to an float using:

cars['kms_driven']=cars['kms_driven'].astype(float)

Handling the 'fuel_type' column

cars = cars[~cars['fuel_type'].isna()]

This operation:

Removes rows where 'fuel_type' is missing (NaN values).
Ensures that all remaining rows have a valid fuel type.

Handling the 'name' column

cars['name'] = cars['name'].str.split(" ").str.slice(0,3).str.join(" ")

This operation:
Splits the 'name' column (car model names) by spaces.
Slices the first three words from the split name (e.g., from "Toyota Corolla 
2020 XLI" to "Toyota Corolla 2020").
Joins the first three words back into a single string.
General data handling

cars.describe()

year	Price	kms_driven
count	816.000000	8.160000e+02	816.000000
mean	2012.444853	4.117176e+05	46275.531863
std	4.002992	4.751844e+05	34297.428044
min	1995.000000	3.000000e+04	0.000000
25%	2010.000000	1.750000e+05	27000.000000
50%	2013.000000	2.999990e+05	41000.000000
75%	2015.000000	4.912500e+05	56818.500000
max	2019.000000	8.500003e+06	400000.000000

cars = cars[cars['Price']<6e6]

This operation:
Keeps only rows where the 'Price' is less than 6 million (6,000,000).
Removes cars with a price of 6 million or more.
Effect on Dataset:Outliers in the 'Price' column are filtered out (i.e., extremely high prices).
This can help focus the analysis on more reasonably priced cars and avoid 
skewing the model with outliers.

cars = cars.reset_index(drop=True)

This operation:
Resets the index of the DataFrame, starting from 0 and incrementing 
sequentially.
Drops the old index (doesn't add it as a new column), ensuring the index is 
now a clean range from 0 to the number of rows minus 1.

Conclusion
In this blog, we have cleaned and preprocessed the dataset, ensuring that it is now 
ready for Exploratory Data Analysis (EDA). By addressing data type mismatches, 
handling missing values, and filtering out outliers, we have refined the data, 
making it suitable for further analysis. With the dataset now clean and consistent, 
we can move on to exploring the relationships between the features and uncovering 
valuable insights in the next phase of the project.

By filtering out irrelevant or incorrect entries, such as non-numeric values, missing 
fuel types, and unrealistic prices, we prepared a refined dataset. We also removed 
outliers and reset the index to ensure the dataset was clean and usable for further 
analysis.

With the data now preprocessed, we can proceed to Exploratory Data Analysis (EDA) 
in the next blog. In EDA, we will explore the relationships between different features, 
visualize data distributions, and uncover patterns that can help in building an accurate 
predictive model for used car prices.

Search This Blog

Machine Learning Projects

Car Price Predictor (Data Preparation)

Issues in the dataset:

Effect on Dataset:

Comments

Post a Comment