Car Price Predictor (Data Preparation)
Predicting Used Car Prices: A Machine Learning Approach
The used car market is vast and dynamic, with prices influenced by various factors such as brand, model, year, mileage, fuel type, and location. Accurately estimating the price of a used car can be a challenging task for both buyers and sellers who are looking to make informed decisions.
In this blog, I walk you through the process of cleaning and preparing a dataset to predict used car prices using machine learning. By working through various data preprocessing steps—such as handling missing values, converting data types, and dealing with outliers—I ensure that the dataset is ready for Exploratory Data Analysis (EDA). This will allow us to identify patterns and relationships that will help build an effective predictive model.
Loading and Exploring the Data
We start by loading the dataset and understanding its structure:
Data Overview
The first step is to inspect the dataset:
Issues in the dataset:
- Data type mismatches:
- The year, Price, and kms_driven columns should be numeric but are stored as objects.
- Missing values:
- kms_driven has 840 non-null values, meaning 52 values are missing.
- fuel_type has 837 non-null values, meaning 55 values are missing.
| name | company | year | Price | kms_driven | fuel_type | |
|---|---|---|---|---|---|---|
| count | 892 | 892 | 892 | 892 | 840 | 837 |
| unique | 525 | 48 | 61 | 274 | 258 | 3 |
| top | Honda City | Maruti | 2015 | Ask For Price | 45,000 kms | Petrol |
| freq | 13 | 235 | 117 | 35 | 30 | 440 |
"Ask For Price", - Extracting only the numeric portion from values like
"40,000 km",
"40,000".- Removing the unit
"km", making it easier to convert to a numeric type.
- Removes commas from the 'kms_driven' column, converting values like
"40,000"
"40000".- Makes it easier to convert the column to a numeric type for further analysis.
- Keeps only rows where 'kms_driven' contains purely numeric values.
- Removes rows with missing values or non-numeric entries (e.g.,
"Unknown"
- Removes rows where 'fuel_type' is missing (NaN values).
- Ensures that all remaining rows have a valid fuel type.
- Splits the 'name' column (car model names) by spaces.
- Slices the first three words from the split name (e.g., from
"Toyota Corolla
2020 XLI" to "Toyota Corolla 2020").- Joins the first three words back into a single string.
| year | Price | kms_driven | |
|---|---|---|---|
| count | 816.000000 | 8.160000e+02 | 816.000000 |
| mean | 2012.444853 | 4.117176e+05 | 46275.531863 |
| std | 4.002992 | 4.751844e+05 | 34297.428044 |
| min | 1995.000000 | 3.000000e+04 | 0.000000 |
| 25% | 2010.000000 | 1.750000e+05 | 27000.000000 |
| 50% | 2013.000000 | 2.999990e+05 | 41000.000000 |
| 75% | 2015.000000 | 4.912500e+05 | 56818.500000 |
| max | 2019.000000 | 8.500003e+06 | 400000.000000 |
This operation:
- Keeps only rows where the 'Price' is less than 6 million (6,000,000).
- Removes cars with a price of 6 million or more.
Effect on Dataset:
- Outliers in the 'Price' column are filtered out (i.e., extremely high prices).
- This can help focus the analysis on more reasonably priced cars and avoid
- Resets the index of the DataFrame, starting from 0 and incrementing
- Drops the old index (doesn't add it as a new column), ensuring the index is
In this blog, we have cleaned and preprocessed the dataset, ensuring that it is now
ready for Exploratory Data Analysis (EDA). By addressing data type mismatches,
handling missing values, and filtering out outliers, we have refined the data,
making it suitable for further analysis. With the dataset now clean and consistent,
we can move on to exploring the relationships between the features and uncovering
valuable insights in the next phase of the project.
By filtering out irrelevant or incorrect entries, such as non-numeric values, missing
fuel types, and unrealistic prices, we prepared a refined dataset. We also removed
outliers and reset the index to ensure the dataset was clean and usable for further
analysis.
With the data now preprocessed, we can proceed to Exploratory Data Analysis (EDA)
in the next blog. In EDA, we will explore the relationships between different features,
visualize data distributions, and uncover patterns that can help in building an accurate
predictive model for used car prices.
Comments
Post a Comment