Car Price Predictor (Data Preparation)

 Predicting Used Car Prices: A Machine Learning Approach

The used car market is vast and dynamic, with prices influenced by various factors such as brand, model, year, mileage, fuel type, and location. Accurately estimating the price of a used car can be a challenging task for both buyers and sellers who are looking to make informed decisions.

In this blog, I walk you through the process of cleaning and preparing a dataset to predict used car prices using machine learning. By working through various data preprocessing steps—such as handling missing values, converting data types, and dealing with outliers—I ensure that the dataset is ready for Exploratory Data Analysis (EDA). This will allow us to identify patterns and relationships that will help build an effective predictive model.

Loading and Exploring the Data

We start by loading the dataset and understanding its structure:

import pandas as pd
import numpy as np
cars = pd.read_csv('quikr_car.csv')

Data Overview

The first step is to inspect the dataset:

cars.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 892 entries, 0 to 891 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 892 non-null object 1 company 892 non-null object 2 year 892 non-null object 3 Price 892 non-null object 4 kms_driven 840 non-null object 5 fuel_type 837 non-null object dtypes: object(6) memory usage: 41.9+ KB

Issues in the dataset:

  1. Data type mismatches:

  • The year, Price, and kms_driven columns should be numeric but are stored as objects.
They need to be converted properly.

  1. Missing values:

  • kms_driven has 840 non-null values, meaning 52 values are missing.
  • fuel_type has 837 non-null values, meaning 55 values are missing.

cars.describe()

namecompanyyearPricekms_drivenfuel_type
count892892892892840837
unique52548612742583
topHonda CityMaruti2015Ask For Price45,000 kmsPetrol
freq132351173530440
cars.isna().sum()

name 0 company 0 year 0 Price 0 kms_driven 52 fuel_type 55 dtype: int64

Data Cleaning

Handling the 'year' column
cars = cars[cars['year'].str.isnumeric()]

This keeps only the rows where the 'year' column contains purely numeric values,
removing any rows with non-numeric data (e.g., missing values, special characters,
or incorrect entries).

Convert the 'year' column to an integer using:
cars['year'] = cars['year'].astype(int)

Handling the 'Price' column
cars = cars[cars['Price']!='Ask For Price']

This removes all rows where the 'Price' column contains the text "Ask For Price",
ensuring that only numeric price values remain in the dataset.

Convert 'Price' to a numeric format by removing unwanted characters (like currency
symbols) and converting the type:

cars['Price']=cars['Price'].str.replace(',','').astype(int)

Handling the 'kms_driven' column
cars['kms_driven'] = cars['kms_driven'].str.split(" ").str.get(0)

This operation modifies the 'kms_driven' column by:
  • Extracting only the numeric portion from values like "40,000 km",
keeping just "40,000".
  • Removing the unit "km", making it easier to convert to a numeric type.
cars['kms_driven'] = cars['kms_driven'].str.replace(',', '')

This operation:
  • Removes commas from the 'kms_driven' column, converting values like "40,000"
to "40000".
  • Makes it easier to convert the column to a numeric type for further analysis.
cars = cars[cars['kms_driven'].str.isnumeric()]

This operation:
  • Keeps only rows where 'kms_driven' contains purely numeric values.
  • Removes rows with missing values or non-numeric entries (e.g., "Unknown"
or empty strings).

Convert the 'kms_driven' column to an float using:
cars['kms_driven']=cars['kms_driven'].astype(float)

Handling the 'fuel_type' column
cars = cars[~cars['fuel_type'].isna()]

This operation:
  • Removes rows where 'fuel_type' is missing (NaN values).
  • Ensures that all remaining rows have a valid fuel type.
Handling the 'name' column
cars['name'] = cars['name'].str.split(" ").str.slice(0,3).str.join(" ")

This operation:
  • Splits the 'name' column (car model names) by spaces.
  • Slices the first three words from the split name (e.g., from "Toyota Corolla
2020 XLI" to "Toyota Corolla 2020").
  • Joins the first three words back into a single string.
General data handling

cars.describe()

yearPricekms_driven
count816.0000008.160000e+02816.000000
mean2012.4448534.117176e+0546275.531863
std4.0029924.751844e+0534297.428044
min1995.0000003.000000e+040.000000
25%2010.0000001.750000e+0527000.000000
50%2013.0000002.999990e+0541000.000000
75%2015.0000004.912500e+0556818.500000
max2019.0000008.500003e+06400000.000000

cars = cars[cars['Price']<6e6]

This operation:

  • Keeps only rows where the 'Price' is less than 6 million (6,000,000).
  • Removes cars with a price of 6 million or more.

Effect on Dataset:

  • Outliers in the 'Price' column are filtered out (i.e., extremely high prices).
  • This can help focus the analysis on more reasonably priced cars and avoid
skewing the model with outliers.

cars = cars.reset_index(drop=True)

This operation:
  • Resets the index of the DataFrame, starting from 0 and incrementing
sequentially.
  • Drops the old index (doesn't add it as a new column), ensuring the index is
now a clean range from 0 to the number of rows minus 1.

Conclusion

In this blog, we have cleaned and preprocessed the dataset, ensuring that it is now

ready for Exploratory Data Analysis (EDA). By addressing data type mismatches,

handling missing values, and filtering out outliers, we have refined the data,

making it suitable for further analysis. With the dataset now clean and consistent,

we can move on to exploring the relationships between the features and uncovering

valuable insights in the next phase of the project.


By filtering out irrelevant or incorrect entries, such as non-numeric values, missing

fuel types, and unrealistic prices, we prepared a refined dataset. We also removed

outliers and reset the index to ensure the dataset was clean and usable for further

analysis.


With the data now preprocessed, we can proceed to Exploratory Data Analysis (EDA)

in the next blog. In EDA, we will explore the relationships between different features,

visualize data distributions, and uncover patterns that can help in building an accurate

predictive model for used car prices.

Comments