Predicting Bangalore House Prices: A Data Science Journey
Introduction
House prices can be influenced by various factors, including location, size, and amenities. This blog takes you through a step-by-step guide to building a predictive model for house prices in Bangalore, using Python. We'll preprocess the data, perform exploratory data analysis, and build a machine learning model.
Loading and Exploring the Data
We start by loading the dataset and understanding its structure:
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
housing = pd.read_csv("Bengaluru_House_Data.csv")
housing.head()
Output:

Data Overview
The first step is to inspect the dataset:
housing.info()
housing.isnull().sum()
The commandsare used in Pandas to get an overview of the dataset and check for
missing values.


Data Cleaning
Handling Specific Issues
We address columns with ambiguous or inconsistent data:
1. Handling size column:
housing_clean["size"].unique()
housing_clean["bhk"] = housing_clean["size"].apply(lambda x: int(x.split(" ")[0]))
This command lists all unique values in the "size" column of the housing_clean DataFrame.
The second line extracts the numeric part from "size" and stores it in a new column "bhk".
2. Cleaning total_sqft column:
The total_sqft column contains ranges and non-numeric values. We process it to extract usable numbers:
def IsFloat(x):
try:
float(x)
except:
return False
return True
This function converts the a data into float data type.
def ConvertToSqFt(x, metric):
if metric == "Acres":
return x * 43560
elif metric == "Cents":
return x * 435.6
elif metric == "Grounds":
return x * 2400
elif metric == "Guntha":
return x * 1088.98
elif metric == "Perch":
return x * 272.25
elif metric == "Sq. Meter":
return x * 10.7639
elif metric == "Sq. Yards":
return x * 9
else:
return np.nan
This function converts into data into standard metrics.
def ExtractTotalSqft(x):
try:
values = x.split("-")
return np.mean(list(map(float, values)))
except:
if x == np.nan:
return np.nan
else:
for intIndex in range(len(x)-1, -1, -1):
if IsFloat(x[0:intIndex]):
return ConvertToSqFt(float(x[0:intIndex]), x[intIndex:])
housing_clean["sqft"] = housing_clean["total_sqft"].apply(ExtractTotalSqft)
This function combines both the above functions to a single function to convert "total_sqft" to float.
3. Handling ‘Bath’ column
The bath column may contain missing or unrealistic values. To address this, we fill missing values based on the most
common number of bathrooms for each bedroom count:
def FillBathrooms(bhk_groupby_bathroom, row):
if pd.isnull(row["bath"]):
return int(bhk_groupby_bathroom[row["bhk"]].index[0])
else:
return int(row["bath"])
bhk_groupby_bathroom = housing_clean.groupby("bhk")["bath"].value_counts()
housing_clean["bath"] = housing_clean.apply(lambda row: FillBathrooms(bhk_groupby_bathroom, row), axis=1)
2. 4. Cleaning balcony column:
The balcony column may also contain missing values. We fill these based on the most common number of balconies for
each bedroom count:
def FillBalcony(bhk_groupby_balcony, row):
if pd.isnull(row["balcony"]):
return int(bhk_groupby_bathroom[row["bhk"]].index[0])
else:
return int(row["balcony"])
bhk_groupby_balcony = housing_clean.groupby("bhk")["balcony"].value_counts()
housing_clean["balcony"] = housing_clean.apply(lambda row: FillBalcony(bhk_groupby_balcony, row), axis=1)
5. Dropping Unnecessary Columns:
We drop columns that are not useful for the analysis:
housing_clean.drop(["society", "size", "total_sqft"], inplace = True, axis=1)
6. Relabelling availability column:
We standardize the availability column by relabelling entries that contain a date range:
def RelabelAvailability(x):
values = x.split("-")
try:
if len(values) > 1:
return "Soon to be Vacated"
else:
return x
except:
return ""
housing_clean["availability"] = housing_clean["availability"].apply(RelabelAvailability)
7. Handling location column:
To reduce noise, we consolidate locations with fewer than 10 occurrences into a single category labelled "Other":
housing_clean["location"] = housing_clean["location"].apply(lambda x: x.strip())
unique_location_count = housing_clean.groupby("location")["location"].agg("count").sort_values(ascending = False)
unique_location_count_10 = unique_location_count[unique_location_count <= 10]
housing_clean["location"] = housing_clean["location"].apply(lambda x : "Other" if x in unique_location_count_10 else x)
8. Adding a New Features and removing the outliers
- We calculate the price per square foot for each property:
housing_clean["price_per_sqft"] = housing_clean["price"] * 100000 / housing_clean["sqft"]
- Outliers in price_per_sqft can skew the model. We remove extreme values:
housing_clean = housing_clean[housing_clean['price_per_sqft'] < housing_clean['price_per_sqft'].quantile(0.99)]
- We also filter properties based on sqft_per_bhk to ensure reasonable ranges:
housing_clean['sqft_per_bhk'] = housing_clean['sqft'] / housing_clean['bhk']
housing_clean = housing_clean[~(housing_clean['sqft_per_bhk'] < 300)]
housing_clean = housing_clean[~(housing_clean['sqft_per_bhk'] > 1200)]
- Finally, we remove properties with extremely large total square footage:
housing_clean = housing_clean[~(housing_clean["sqft"] > 6000)]
- We address BHK outliers using statistical analysis:
def remove_bhk_outliers(df):
exclude_indices = np.array([])
for location, location_df in df.groupby('location'):
bhk_stats = {}
for bhk, bhk_df in location_df.groupby('bhk'):
bhk_stats[bhk] = {
'mean': np.mean(bhk_df.price_per_sqft),
'std': np.std(bhk_df.price_per_sqft),
'count': bhk_df.shape[0]
}
for bhk, bhk_df in location_df.groupby('bhk'):
stats = bhk_stats.get(bhk-1)
if stats and stats['count']>5:
exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
return df.drop(exclude_indices,axis='index')
housing_clean = remove_bhk_outliers(housing_clean)
Conclusion
In this blog, we explored how to understand a dataset using Pandas by checking general statistics and
identifying missing values.
Next up: In the next blog, we will dive into Exploratory Data Analysis (EDA) to visualize trends,
detect outliers, and gain deeper insights from our data. Stay tuned!
Comments
Post a Comment