Bangalore House Price Prediction (EDA and Model building)

Exploring EDA and Model Building: Unveiling Patterns in Data

In our previous blog, we laid the foundation for data handling—understanding how to clean, preprocess, and prepare raw data for analysis. Now, we take the next step in our data science journey by diving into Exploratory Data Analysis (EDA) and model building.

EDA serves as the critical bridge between raw data and meaningful insights. It helps us uncover patterns, detect anomalies, and gain a deeper understanding of the dataset through visualization and statistical analysis. With a well-explored dataset, we can then move forward to model building, where we apply machine learning algorithms to make predictions and extract valuable knowledge.

Exploratory Data Analysis

Distribution of Prices

A visualization of the target variable helps identify patterns and outliers:

plt.figure(figsize=(10, 6))

sns.histplot(housing_clean['price'], bins=30, kde=True)
plt.title("Distribution of House Prices")
plt.xlabel("Price (in Lakhs)")
plt.ylabel("Frequency")
plt.show()

Correlation Heatmap:

Exploring relationships between numerical features:

plt.figure(figsize=(6, 6))
sns.heatmap(housing_clean.drop(columns = ["area_type", "availability", "location"]).corr(),
annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

Building the Predictive Model

Splitting the Data

from sklearn.model_selection import train_test_split
X = housing_clean.drop(columns = "price")
y = housing_clean['price']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

Importing the dependencies to build and evaluate the predictive model:

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

Utilizing a ColumnTransformer to standardize and transform numerical features efficiently:

column_trans = make_column_transformer((OneHotEncoder(sparse_output=False),
["area_type", "availability", "location"]),
remainder='passthrough')

Objectifying the functions:
scaler = StandardScaler()
lr = LinearRegression()
la = Lasso()
ri = Ridge()
xgb = XGBRegressor()
rfr = RandomForestRegressor()
ada = AdaBoostRegressor()
gbr = GradientBoostingRegressor()
dt = DecisionTreeRegressor()

Creating the dictionary of the above objects:

Regressors = {
    'Linear Regression' : lr,
    'Lasso' : la,
    'Ridge': ri,
    'XGBRegressor': xgb,
    'RandomForestRegressor': rfr,
    'AdaBoostRegressor': ada,
    'gradient Boost Regressor': gbr,
    'Decision Tree Regressor': dt
}

Applying the for loop to run the algorithms shown above to find the best model:

for name,model in Regressors.items():
    pipe = make_pipeline(column_trans, scaler,model)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print("For ",name)
    print("R2 Score ",r2_score(y_test,y_pred))
    print("MSE ",mean_squared_error(y_test,y_pred))
    print("==================================================")

And these are the results:

For Linear Regression R2 Score 0.7970679542666947 MSE 1945.7399545076323 ================================================== For Lasso R2 Score 0.7875594982465522 MSE 2036.908319352207 ================================================== For Ridge R2 Score 0.7970501849908871 MSE 1945.9103287320524 ================================================== For XGBRegressor R2 Score 0.8508123625187537 MSE 1430.4312850980166 ================================================== For RandomForestRegressor R2 Score 0.8285199494851324 MSE 1644.1739621856352 ================================================== For AdaBoostRegressor R2 Score 0.7070223697603215 MSE 2809.109221140365 ================================================== For gradient Boost Regressor R2 Score 0.8289161842782232 MSE 1640.3748092943463 ================================================== For Decision Tree Regressor R2 Score 0.7453190035472872 MSE 2441.915906682901 ==================================================

As the XGboost regressor has the highest R2_score, that model has been selected and a pipeline has

been made:

pipe = make_pipeline(column_trans, scaler,xgb)
pipe.fit(X_train, y_train)

The pipeline has been dumped using pickle module to be loaded in web application to make predictions:

import pickle
pickle.dump(pipe,open("Model.pkl",'wb'))

Conclusion

In this blog, we explored the process of Exploratory Data Analysis (EDA) and Model Building,

uncovering patterns in the dataset through visualization and statistical analysis.

We examined the distribution of house prices, identified relationships between numerical features using

a correlation heatmap, and implemented various machine learning algorithms to build a predictive

model.

After evaluating multiple regression models, XGBoost Regressor emerged as the best-performing model

with the highest R² score of 0.85 and the lowest Mean Squared Error (MSE).

This model was then incorporated into a pipeline, ensuring efficient preprocessing and prediction, and

subsequently saved using pickle for deployment in a web application.

This structured approach—from data handling to model deployment—demonstrates the power of

EDA and machine learning in deriving valuable insights and making accurate predictions.

With the trained model now ready, it can be integrated into real-world applications to assist in data-driven

decision-making.

Stay tuned for the next blog, where we will explore deploying this model in a web application

for real-time predictions!

Comments