Banglore House Price Predictions (Data Handling)

Predicting Bangalore House Prices: A Data Science Journey

Introduction

House prices can be influenced by various factors, including location, size, and amenities. This blog takes you through a step-by-step guide to building a predictive model for house prices in Bangalore, using Python. We'll preprocess the data, perform exploratory data analysis, and build a machine learning model.

Loading and Exploring the Data

We start by loading the dataset and understanding its structure:

Code:

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

housing = pd.read_csv("Bengaluru_House_Data.csv")

housing.head()

Output:

Data Overview

The first step is to inspect the dataset:

housing.info()
housing.isnull().sum()

The commandsare used in Pandas to get an overview of the dataset and check for
missing values.

Data Cleaning

Handling Specific Issues

We address columns with ambiguous or inconsistent data:

1. Handling size column:

housing_clean["size"].unique()

housing_clean["bhk"] = housing_clean["size"].apply(lambda x: int(x.split(" ")[0]))

This command lists all unique values in the "size" column of the housing_clean DataFrame.

The second line extracts the numeric part from "size" and stores it in a new column "bhk".

2. Cleaning total_sqft column:

The total_sqft column contains ranges and non-numeric values. We process it to extract usable numbers:

def IsFloat(x):
    try:
        float(x)
    except:
        return False
    return True

This function converts the a data into float data type.

def ConvertToSqFt(x, metric):
    if metric == "Acres":
        return x * 43560
    elif metric == "Cents":
        return x * 435.6
    elif metric == "Grounds":
        return x * 2400
    elif metric == "Guntha":
        return x * 1088.98
    elif metric == "Perch":
        return x * 272.25
    elif metric == "Sq. Meter":
        return x * 10.7639
    elif metric == "Sq. Yards":
        return x * 9
    else:
        return np.nan

This function converts into data into standard metrics.

def ExtractTotalSqft(x):
    try:
        values = x.split("-")
        return np.mean(list(map(float, values)))
    except:
        if x == np.nan:
            return np.nan
        else:
            for intIndex in range(len(x)-1, -1, -1):
                if IsFloat(x[0:intIndex]):
                    return ConvertToSqFt(float(x[0:intIndex]), x[intIndex:])

housing_clean["sqft"] = housing_clean["total_sqft"].apply(ExtractTotalSqft)

This function combines both the above functions to a single function to convert "total_sqft" to float.

3. Handling ‘Bath’ column

The bath column may contain missing or unrealistic values. To address this, we fill missing values based on the most common number of bathrooms for each bedroom count:

def FillBathrooms(bhk_groupby_bathroom, row):
    if pd.isnull(row["bath"]):
        return int(bhk_groupby_bathroom[row["bhk"]].index[0])
    else:
        return int(row["bath"])

bhk_groupby_bathroom = housing_clean.groupby("bhk")["bath"].value_counts()
housing_clean["bath"] = housing_clean.apply(lambda row: FillBathrooms(bhk_groupby_bathroom, row), axis=1)

2.      4. Cleaning balcony column:

The balcony column may also contain missing values. We fill these based on the most common number of balconies for

each bedroom count:

def FillBalcony(bhk_groupby_balcony, row):
    if pd.isnull(row["balcony"]):
        return int(bhk_groupby_bathroom[row["bhk"]].index[0])
    else:
        return int(row["balcony"])
bhk_groupby_balcony = housing_clean.groupby("bhk")["balcony"].value_counts()
housing_clean["balcony"] = housing_clean.apply(lambda row: FillBalcony(bhk_groupby_balcony, row), axis=1)

5. Dropping Unnecessary Columns:

We drop columns that are not useful for the analysis:

housing_clean.drop(["society", "size", "total_sqft"], inplace = True, axis=1)

6. Relabelling availability column:

We standardize the availability column by relabelling entries that contain a date range:

def RelabelAvailability(x):
    values = x.split("-")
    try:
        if len(values) > 1:
            return "Soon to be Vacated"
        else:
            return x
    except:
            return ""
housing_clean["availability"] = housing_clean["availability"].apply(RelabelAvailability)

7.   Handling location column:
To reduce noise, we consolidate locations with fewer than 10 occurrences into a single category labelled "Other":

housing_clean["location"] = housing_clean["location"].apply(lambda x: x.strip())
unique_location_count = housing_clean.groupby("location")["location"].agg("count").sort_values(ascending = False)
unique_location_count_10 = unique_location_count[unique_location_count <= 10]
housing_clean["location"] = housing_clean["location"].apply(lambda x : "Other" if x in unique_location_count_10 else x)

8. Adding a New Features and removing the outliers
  • We calculate the price per square foot for each property:
            housing_clean["price_per_sqft"] = housing_clean["price"] * 100000 / housing_clean["sqft"]

  • Outliers in price_per_sqft can skew the model. We remove extreme values:
            housing_clean = housing_clean[housing_clean['price_per_sqft'] < housing_clean['price_per_sqft'].quantile(0.99)]

  • We also filter properties based on sqft_per_bhk to ensure reasonable ranges:
            housing_clean['sqft_per_bhk'] = housing_clean['sqft'] / housing_clean['bhk']
            housing_clean = housing_clean[~(housing_clean['sqft_per_bhk'] < 300)]
            housing_clean = housing_clean[~(housing_clean['sqft_per_bhk'] > 1200)]

  • Finally, we remove properties with extremely large total square footage:
            housing_clean = housing_clean[~(housing_clean["sqft"] > 6000)]

  • We address BHK outliers using statistical analysis:
            def remove_bhk_outliers(df):
                exclude_indices = np.array([])
                for location, location_df in df.groupby('location'):
                    bhk_stats = {}
                    for bhk, bhk_df in location_df.groupby('bhk'):
                        bhk_stats[bhk] = {
                        'mean': np.mean(bhk_df.price_per_sqft),
                        'std': np.std(bhk_df.price_per_sqft),
                        'count': bhk_df.shape[0]
                    }
                for bhk, bhk_df in location_df.groupby('bhk'):
                    stats = bhk_stats.get(bhk-1)
                    if stats and stats['count']>5:
                        exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
                return df.drop(exclude_indices,axis='index')
            housing_clean = remove_bhk_outliers(housing_clean)

Conclusion

In this blog, we explored how to understand a dataset using Pandas by checking general statistics and

identifying missing values.

Next up: In the next blog, we will dive into Exploratory Data Analysis (EDA) to visualize trends,

detect outliers, and gain deeper insights from our data. Stay tuned!

Comments