Analysis: Boost The Probability of Success with Your AirBnB in Milan

Scope of The Analysis

This analysis doesn’t try to predict the success of an AirBnB in Milan. There are too many factors that influence it, and it would require much more data. Probably, even the collaboration of AirbnB itself.

This analysis serves as a solid baseline to increase the probability of having success with your AirBnB in Milan.

In A Hurry?

Here is the Github repository with all the code used.

And this is the final result (on Tableau):

If you prefer to watch it from the official Tableau website, here is the link.

Data Gathering

Data was sourced from InsideAirbnb. It provided three datasets: listings, comments, and bookings.

These datasets were combined to get most out of them. Here’s a sneak peek at the listings dataset:

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude
0	6400	Rental unit in Milan · ★4.89 · 3 bedrooms · 1 ...	13822	Francesca	NaN	TIBALDI	45.44119
1	23986	Rental unit in Milan · ★4.64 · 1 bedroom · 1 b...	95941	Jeremy	NaN	NAVIGLI	45.44806
2	24107	Condo in Milan · ★4.50 · 1 bedroom · 6 beds · ...	46951	Valeria	NaN	CITTA' STUDI	45.47179

Data Cleaning

The most valuable column was the post descriptions. For instance, consider this record:

“Rental unit in Milan · ★4.89 · 3 bedrooms · 1 bed · 3.5 baths.”

I used regex expressions to extract the most crucial information. The entire cleaning and merging process is documented in the Jupyter notebook on GitHub.

Here is the code I used to extract the star rating.

				
					pattern = r'★([\d\.]+)'

df_listing['stars'] = df_listing['name'].str.findall(pattern).str.get(0)

Machine Learning

Then, a regression model has been fitted to better understand what influences the revenue. The model was from Stats Model, it offers more functionalities than the one from ScikitLearn.

Here’s the formula I used to fit the model:

You can find the whole machine learning process on GitHub.

				
					# After the ANOVA test, create df2.
#A dataframe that has only values that really influence revenue.

num_col_to_keep = ['bed','price','minimum_nights','days_online']

cat_col_to_keep = ['neighbourhood','star_group']

col_to_keep2 = num_col_to_keep + cat_col_to_keep + [to_predict]
df2 = df_f[col_to_keep2]
df2 = sm.add_constant(df2)

# Create the model
formula = f'{to_predict} ~ (const)'

for col in num_col_to_keep:
    formula += f' + {col}'

for col in cat_col_to_keep:
    formula += f' + C({col})'

model_f = ols(formula, df2).fit()
model_f.summary()

Model Performance Visualization

You can discover many, many insights with the coefficients of a regression model. That’s what I did for this project.

I used them with a 95% confidence interval to discover the key influencers on a listing’s revenue.

Check Tableau for the model performance visualization.

Finding Insights

Well… I can’t spill all the beans here. This part is exclusively for people who work with me.

Interested? Let’s work together!