Bangalore House Price Prediction (EDA and Model building)

Exploring EDA and Model Building: Unveiling Patterns in Data

In our previous blog, we laid the foundation for data handling—understanding how to clean, preprocess, and prepare raw data for analysis. Now, we take the next step in our data science journey by diving into Exploratory Data Analysis (EDA) and model building.

EDA serves as the critical bridge between raw data and meaningful insights. It helps us uncover patterns, detect anomalies, and gain a deeper understanding of the dataset through visualization and statistical analysis. With a well-explored dataset, we can then move forward to model building, where we apply machine learning algorithms to make predictions and extract valuable knowledge.

Exploratory Data Analysis

Distribution of Prices

A visualization of the target variable helps identify patterns and outliers:

plt.figure(figsize=(10, 6))

sns.histplot(housing_clean['price'], bins=30, kde=True)
plt.title("Distribution of House Prices")
plt.xlabel("Price (in Lakhs)")
plt.ylabel("Frequency")
plt.show()

Correlation Heatmap:
Exploring relationships between numerical features:
plt.figure(figsize=(6, 6))
sns.heatmap(housing_clean.drop(columns = ["area_type", "availability", "location"]).corr(), 
                                                 annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

Building the Predictive Model
Splitting the Data
from sklearn.model_selection import train_test_split
X = housing_clean.drop(columns = "price")
y = housing_clean['price']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2) 
Importing the dependencies to build and evaluate the predictive model:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error
Utilizing a ColumnTransformer to standardize and transform numerical features efficiently:
column_trans = make_column_transformer((OneHotEncoder(sparse_output=False), 
                                                                ["area_type", "availability", "location"]), 
                                                                remainder='passthrough')

Objectifying the functions:
scaler = StandardScaler()
lr = LinearRegression()
la = Lasso()
ri = Ridge()
xgb = XGBRegressor()
rfr = RandomForestRegressor()
ada = AdaBoostRegressor()
gbr = GradientBoostingRegressor()
dt = DecisionTreeRegressor()
Creating the dictionary of the above objects:
Regressors = {
    'Linear Regression' : lr,
    'Lasso' : la, 
    'Ridge': ri,
    'XGBRegressor': xgb,
    'RandomForestRegressor': rfr, 
    'AdaBoostRegressor': ada, 
    'gradient Boost Regressor': gbr, 
    'Decision Tree Regressor': dt
}
Applying the for loop to run the algorithms shown above to find the best model:
for name,model in Regressors.items():
    pipe = make_pipeline(column_trans, scaler,model)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print("For ",name)
    print("R2 Score ",r2_score(y_test,y_pred))
    print("MSE ",mean_squared_error(y_test,y_pred))
    print("==================================================")
And these are the results:
For  Linear Regression
R2 Score  0.7970679542666947
MSE  1945.7399545076323
==================================================
For  Lasso
R2 Score  0.7875594982465522
MSE  2036.908319352207
==================================================
For  Ridge
R2 Score  0.7970501849908871
MSE  1945.9103287320524
==================================================
For  XGBRegressor
R2 Score  0.8508123625187537
MSE  1430.4312850980166
==================================================
For  RandomForestRegressor
R2 Score  0.8285199494851324
MSE  1644.1739621856352
==================================================
For  AdaBoostRegressor
R2 Score  0.7070223697603215
MSE  2809.109221140365
==================================================
For  gradient Boost Regressor
R2 Score  0.8289161842782232
MSE  1640.3748092943463
==================================================
For  Decision Tree Regressor
R2 Score  0.7453190035472872
MSE  2441.915906682901
==================================================
As the XGboost regressor has the highest R2_score, that model has been selected and a pipeline has 
been made:
pipe = make_pipeline(column_trans, scaler,xgb)
pipe.fit(X_train, y_train)
The pipeline has been dumped using pickle module to be loaded in web application to make predictions:
import pickle
pickle.dump(pipe,open("Model.pkl",'wb'))

Conclusion
In this blog, we explored the process of Exploratory Data Analysis (EDA) and Model Building, 
uncovering patterns in the dataset through visualization and statistical analysis. 
We examined the distribution of house prices, identified relationships between numerical features using
a correlation heatmap, and implemented various machine learning algorithms to build a predictive 
model.
After evaluating multiple regression models, XGBoost Regressor emerged as the best-performing model 
with the highest R² score of 0.85 and the lowest Mean Squared Error (MSE). 
This model was then incorporated into a pipeline, ensuring efficient preprocessing and prediction, and 
subsequently saved using pickle for deployment in a web application.
This structured approach—from data handling to model deployment—demonstrates the power of 
EDA and machine learning in deriving valuable insights and making accurate predictions. 
With the trained model now ready, it can be integrated into real-world applications to assist in data-driven
decision-making.
Stay tuned for the next blog, where we will explore deploying this model in a web application 
for real-time predictions! 

Search This Blog

Machine Learning Projects

Bangalore House Price Prediction (EDA and Model building)

Comments

Post a Comment