Exploratory Data Analysis - Airbnb London Dataset

By Aejaz Khan

Published on: January 11, 2018

Duration: 3 Months
Role: Python Data Analyst

image alt text

Sharing

About London

London is the capital and largest city of England and the United Kingdom. The city stands on the River Thames in the south-east of England, at the head of its 50-mile (80 km) estuary leading to the North Sea. London has been a major settlement for two millennia, and was originally called Londinium, which was founded by the Romans. The City of London, London's ancient core and financial centre�an area of just 1.12 square miles (2.9 km2) and colloquially known as the Square Mile�retains boundaries that closely follow its medieval limits. The adjacent City of Westminster has for centuries been the location of much of the national government. Thirty-one additional boroughs north and south of the river also comprise modern London. The London region is governed by the mayor of London and the London Assembly.

London is one of the world's most important global cities. It exerts a considerable impact upon the arts, commerce, education, entertainment, fashion, finance, healthcare, media, professional services, research and development, tourism and transportation. It is one of the largest financial centres in the world and in 2019, London had the second highest number of ultra high-net-worth individuals in Europe, after Paris. And in 2020, London had the second-highest number of billionaires of any city in Europe, after Moscow. London's universities form the largest concentration of higher education institutes in Europe, and London is home to highly ranked institutions such as Imperial College London in natural and applied sciences, the London School of Economics and social sciences, as well as the comprehensive University College London. In 2012, London became the first city to have hosted three modern Summer Olympic Games.

London has a diverse range of people and cultures, and more than 300 languages are spoken in the region. Its estimated mid-2018 municipal population (corresponding to Greater London) was roughly 9 million, which made it the third-most populous city in Europe. London accounts for 13.4% of the U.K. population. Greater London Built-up Area is the fourth-most populous in Europe, after Istanbul, Moscow, and Paris, with 9,787,426 inhabitants at the 2011 census. The London metropolitan area is the third-most populous in Europe, after Istanbul and the Moscow Metropolitan Area, with 14,040,163 inhabitants in 2016.

London contains four World Heritage Sites: the Tower of London; Kew Gardens; the site comprising the Palace of Westminster, Westminster Abbey, and St Margaret's Church; and the historic settlement in Greenwich where the Royal Observatory, Greenwich defines the Prime Meridian (0� longitude) and Greenwich Mean Time. Other landmarks include Buckingham Palace, the London Eye, Piccadilly Circus, St Paul's Cathedral, Tower Bridge, Trafalgar Square and The Shard. London has numerous museums, galleries, libraries and sporting events. These include the British Museum, National Gallery, Natural History Museum, Tate Modern, British Library and West End theatres. The London Underground is the oldest underground railway network in the world.

London neighborhoods

 # This Python 3 environment comes with many helpful analytics libraries installed
 # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
 import numpy as np # linear algebra
 import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
 from pathlib import Path
 import matplotlib.pyplot as plt
 import seaborn as sns
 %matplotlib inline
 plt.style.use('bmh')
 import warnings
 warnings.filterwarnings('ignore')

Data import

  from google.colab import drive
  drive.mount('/content/drive')

 # Input data files are available in the read-only "../input/" directory
 # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

  import os
  dfdict = {}
  for dirname, _, filenames in os.walk('/content/drive/MyDrive/data-set/      inside-airbnb-london'):
      for filename in filenames:
          if filename.endswith('.csv'):
              print(os.path.join(dirname, filename))
              name = Path(os.path.join(dirname, filename)).stem
              dfdict[name] = pd.read_csv(os.path.join(dirname, filename))

  # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
  # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's look at the shape of the data

with pd.option_context('display.max_rows', 6, 'display.max_columns', None):  # show all df columns
  for name, df_ in dfdict.items():
      display(name)
      display(df_.shape)
      display(df_.head(6))

The main file is listings. Taking a quick look at its data, we can observe 76534 property listings on the London Airbnb dataset, which provides 74 columns of information, such as listing and picture URL's, the date when the listing was scraped from Airbnb's website, name and description of the property and a textual overview of its neighborhood, data about the host (id, url, name, location, short-bio, if the host was verified by Airbnb and some statistics about his/her response time). There are also columns containing geographical data about the property, such as latitude and longitude, its neighborhood, the property and room type (room, flat, etc.), how many persons it accommodates, how many beds, bedroom and bathrooms it has, room amenities, and, of course, information about price. The final group of columns regards several statistics about availability, guest reviews, and property rating (number of stars?).

2. Analyzing columns

Let's perform an initial analysis of columns, including data types, removal of irrelevant columns, or columns with too much missing values, and finally we'll setup an index for each dataframe.

2.1. Check data types

Let�s see how Pandas determined the types of each column when loading them

for name, df_ in dfdict.items():
  display(name)
  display(df_.info())

Dropping irrelevant information or with too much missing values

From these informations above we can already see that some features won't be relevant in our exploratory analysis as there are too much missing values (such as license and bathrooms). Plus there is so much features to analyse that it may be better to concentrate on the ones which can give us real insights.

  # Remove columns with almost no values: listings: bathrooms, neighbourhood_group_cleansed,calendar_updated, license
  dfdict['listings'].drop(columns=['bathrooms', 'neighbourhood_group_cleansed', 'calendar_updated', 'license'], inplace=True)

  # Remove irrelevant columns
  dfdict['listings'].drop(columns=['listing_url', 'picture_url', 'host_url', 'host_name',     'host_thumbnail_url', 'host_picture_url'], inplace=True)
  dfdict['listings'].drop(columns=['minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm'], inplace=True)

2.3. Analyzing the date column on calendar and reviews dataframes

print(dfdict['calendar']['date'].head())
print(dfdict['reviews']['date'].head())

We need to convert the following columns from object to datetime64: calendar.date, reviews.date, listings: host_since,

dfdict['calendar']['date'] = pd.to_datetime(dfdict['calendar']['date'])
dfdict['reviews']['date'] = pd.to_datetime(dfdict['reviews']['date'])
dfdict['listings']['last_scraped'] = pd.to_datetime(dfdict['listings']['last_scraped'])
dfdict['listings']['host_since'] = pd.to_datetime(dfdict['listings']['host_since'])
dfdict['listings']['calendar_last_scraped'] = pd.to_datetime(dfdict['listings']['calendar_last_scraped'])
dfdict['listings']['first_review'] = pd.to_datetime(dfdict['listings']['first_review'])
dfdict['listings']['last_review'] = pd.to_datetime(dfdict['listings']['last_review'])

We can see that the dataset date range is [2010-08-18, 2011-10-09]

2.4. Converting other columns

def convert_price(df_column):
  return df_column.str.replace('$', '', regex = 'true').str.replace(',', '', regex = 'true').astype(float)

dfdict['listings']['price'] = convert_price(dfdict['listings']['price'])
dfdict['calendar']['price'] = convert_price(dfdict['calendar']['price'])
dfdict['calendar']['adjusted_price'] = convert_price(dfdict['calendar']['adjusted_price'])

Other fields that need to be converted:

% => host_response_rate, host_acceptance_rate;

int => accommodates, bathrooms, bedrooms, beds, minimum_nights, maximum_nights, availability_30, availability_60, availability_90, availability_365, number_of_reviews, number_of_reviews_ltm, number_of_reviews_l30d, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms);

? => license => we do not know its datatype and it does not contain enough data.

def convert_boolean(df_column):
  return df_column.replace({'f': 0, 't': 1}).astype('boolean')

# Convert t/f fields to boolean
# calendar dataframe : available
# listings dataframe : (host_is_superhost, host_has_profile_pic, host_identity_verified, calendar_updated, has_availability, instant_bookable)
dfdict['calendar']['available'] = convert_boolean(dfdict['calendar']['available'])
dfdict['listings']['host_is_superhost'] = convert_boolean(dfdict['listings']['host_is_superhost'])
dfdict['listings']['host_has_profile_pic'] = convert_boolean(dfdict['listings']['host_has_profile_pic'])
dfdict['listings']['host_identity_verified'] = convert_boolean(dfdict['listings']['host_identity_verified'])

dfdict['listings']['has_availability'] = convert_boolean(dfdict['listings']['has_availability'])
dfdict['listings']['instant_bookable'] = convert_boolean(dfdict['listings']['instant_bookable'])

# Convert 'host_acceptance_rate', 'host_response_rate', removing the %
dfdict['listings']['host_acceptance_rate'] = dfdict['listings']['host_acceptance_rate'].str.replace('%', '', regex = 'true').str.replace(',', '', regex = 'true').astype(float)
dfdict['listings']['host_response_rate'] = dfdict['listings']['host_response_rate'].str.replace('%', '', regex = 'true').str.replace(',', '', regex = 'true').astype(float)

# Convert % fields to float: listings.host_response_rate, listings.host_acceptance_rate
  display(dfdict['calendar'].info())
# Convert floats to int: listings.bathrooms, listings.bedrooms, listings.beds
  display(dfdict['listings'].info())

We have now converted all important columns from object to their appropriate datatype, except for columns containing large texts or arrays of values. Let's take a look at the index of each dataframe.

2.5. Check current indices

for name, df_ in dfdict.items():
  display(name, df_.index, '------------------------')

2.5.1. Calendar dataframe

print(len(dfdict['calendar'].index))
print(dfdict['calendar'].groupby(['listing_id', 'date'])['available'].transform('nunique')) #count(distinct)
print(dfdict['calendar'].groupby(['listing_id', 'date'])['available'].count())

# Calendar appears to have listing_id and date as index
dfdict['calendar'].set_index(['listing_id', 'date'], inplace=True)

2.5.2. Listings and Reviews dataframes

These two dataframes have a predefined id field.

print('[Listings] number of records: ', len(dfdict['listings'].index))
print('[Listings] Unique id values: ', len(dfdict['listings']['id'].unique()))
print('[Reviews] Listings: number of records: ', len(dfdict['reviews'].index))
print('[Reviews] Unique id values: ', len(dfdict['reviews']['id'].unique()))

Observe that the dataset contains 76534 listed properties and over 1 million reviews. There are also 27 million lines on the calendar dataframe.

Both Listings and Reviews dataframes can have column 'id' as index, since it has unique values Let's set id as index for these 2 dataframes.

dfdict['listings'].set_index('id', inplace=True)
dfdict['reviews'].set_index('id', inplace=True)

3. Data Cleaning

The next step in the process of EDA is Data Cleaning. It is very important to get rid of the irregularities and clean the data after sourcing it into our system. Irregularities are of different types of data.

Missing Values
Incorrect Format
Incorrect Headers
Anomalies/Outliers

3.1. Check for null values

 for name, df_ in dfdict.items():
  display(name)
  display(df_.isnull().sum())  # isna() does the same thing

As we can see, on the calendar dataframe, 4 columns contain missing values. We'll simply ignore the NaN values when plotting the graphs and analyzing the data.

On the listings dataframe, we'll ignore pure textual columns (with large sentences/text), such as: name, description, neighborhood_overview, host_about, host_neighborhood and neighborhood.

For now, we'll ignore the reviews dataframe, since it only contains textual (reviewer's comments) data.

Let�s see how to handle the other missing values:

host_response_time 43221

host_response_rate 43221

host_acceptance_rate 34703

host_is_superhost 46

bathrooms_text 181

bedrooms 4838

beds 1219

first_review 22194

last_review 22194

review_scores_rating 23937

review_scores_accuracy 23999

review_scores_cleanliness 23990

review_scores_checkin 24046

review_scores_communication 23997

review_scores_location 24045

review_scores_value 24046

reviews_per_month 22194

We can handle missing values by dropping the missing records or by imputing the values. Or we can simply choose to ignore them at this moment. That's exactly what we are going to do.

### 3.3. Handling Outliers

Let's take a look at some outliers, regarding property price.

Ouliers can be handled by dropping the records or imputing with the values or leaving them as is, if it makes more sense.

4. Feature analysis

4.1. Univariate Analysis

If we analyze data over a single variable/column from a dataset, it is known as Univariate Analysis.

4.1.1. Categorical Unordered Univariate Analysis

An unordered variable is a categorical variable that has no defined order. If we take our data as an example, the neighbourhood_cleansed column in the dataset is divided into many sub-categories like ..., etc. There is no weight or measure given to any value in the �neighbourhood_cleansed� column. Now, let�s analyze the neighbourhood_cleansed category by using plots. Since neighbourhood_cleansed is a category, we will plot the bar plot.

  print(dfdict['listings']['price'].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95, 0.98, 0.99, 0.997]))

 # Let's calculate the percentage of each job status category.
 display(dfdict['listings'].neighbourhood_cleansed.value_counts(normalize=True))

 # plot the bar graph of percentage job categories
  plt.figure(figsize = (12, 6))
  dfdict['listings'].neighbourhood_cleansed.value_counts(normalize=True).plot.barh()
  plt.show()

By the above bar plot, we can infer that the data set contains more number of Westminster bnb's compared to other neighborhoods. Other neighborhoods are also frequent, such as Tower Hamlets, Hackney, Kensignton and Chelsea, Camden and Islington.

4.1.2. Categorical Ordered Univariate Analysis

Ordered variables are those variables that have a natural rank of order. Some examples of categorical ordered variables from our dataset are:

property_type: Entire cottage, Campsite, Shared Room, ...

Bedrooms, beds: 1, 2, 3, ...

Now, let�s analyze the property_type from the dataset. Since we�ve already seen a bar plot, let�s see how a Pie Chart looks like.

 print('bedrooms: ', dfdict['listings']['bedrooms'].unique())
 print('beds: ', dfdict['listings']['beds'].unique())
 print('review_scores_rating', dfdict['listings']['review_scores_rating'].unique())

What types of property do we have?

 dfdict['listings']['property_type'].unique()

 # calculate the percentage of each property type.
 display(dfdict['listings']['property_type'].value_counts(normalize=True))

 # plot the pie chart of property categories
 plt.figure(figsize = (20, 12))
 dfdict['listings']['property_type'].value_counts(normalize=True).plot.pie(autopct='%1.0f%%', pctdistance=1.1, labeldistance=1.2, rotatelabels=True)
 plt.show()

By the above analysis, we can infer that the data set has a large number of Entire apartments, followed by Private rooms in apartment and then by Private rooms in houses and Entire houses. Also, among the properties with very small percentage, we have 'Shared room in bus', 'Shared room in hotel', 'Room in minsu', 'Shared room in tent' and 'Earth house'.

4.1.3. Numerical features

If the column or variable is numerical, then we�ll analyze it by calculating its mean, median, standard deviation, etc. We can get those values by using the describe function.

 for name, df_ in dfdict.items():
     display(name)
     display(df_.describe())

4.1.3. a) Visualize Numerical Data Distributions � Histogram Plot of all features

Let's get all the types of our data from our dataset and take only the numerical ones.

 df_num = dfdict['listings'].select_dtypes(include = ['float64', 'int64'])
 df_num.head()

  params = {'axes.titlesize':'8', 'xtick.labelsize':'12', 'ytick.labelsize':'12'}
  plt.rcParams.update(params)
  df_num.drop(columns=['scrape_id', 'host_id']).hist(figsize=(20, 20), bins=50, xlabelsize=8, ylabelsize=8, ); # avoid having the matplotlib verbose informations

Features such as beds and bedrooms seem to share a similar distribution to the one we have with price. Regarding review_scores_rating, remark that the majority of scores are concentrated above 75%.

4.1.3. b) Visualize Price Distributions � Seaborn Histogram

Let's take a look at how the property price is distributed.

 print(dfdict['listings']['price'].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95, 0.98, 0.99, 0.997]))
 plt.figure(figsize=(9, 8))
 distplot(dfdict['listings']['price'], color='g', bins=100, hist_kws={'alpha': 0.4});

With this information we can see that the prices are skewed right and some outliers lie above ~1250. Notice that the percentile 99.7% corresponds to this price of 1250 GBP.

Let's regenerate the graph after removing these outliers (price > 1250 GBP).

 plt.figure(figsize=(9, 8))
 sns.distplot(dfdict['listings'][dfdict['listings']['price'] <= 1250]['price'], color='g', bins=100, hist_kws={'alpha': 0.4});

4.1.3. d) Box plot of 'price', 'review_scores_cleanliness' and 'review_scores_rating'

Property prices

First, let's take a look at the property price distribution.

Since there are some price outliers above the 1200 pounds range, we will filter them out when doing the box plots.

 plt.figure(figsize=(10,8))
 plt.title('Price distribution for properties that cost < GBP 1200')
 sns.boxplot(y='price', x='room_type', data = dfdict['listings'][dfdict['listings']['price'] < 1200])

 plt.figure(figsize=(10,8))  # 'review_scores_cleanliness', 'review_scores_rating'
 plt.title('Review scores distribution for London Airbnb properties.')
 sns.boxplot(y='review_scores_rating', x='room_type', data = dfdict['listings'][['room_type', 'review_scores_rating']])

We can see that the majority of the review scores is above 80, regardless of the room type.

4.1.3. e) Calculate and Visualize Correlations � Seaborn Heat Map

Let's see if some variables are linked between each other and then try to explain their relation with common sense.

 corr = df_num.corr()
 plt.figure(figsize=(12, 10))

 sns.heatmap(corr[(corr >= 0.1) | (corr <= -0.1)], 
         cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
         annot=True, annot_kws={"size": 8}, square=True);

Only a few features seem to be correlated with each other. For example,

availability 30/60/90/365
bedrooms/beds/accomodates
review scores rating/accuracy/cleanliness/checkin/communication/value: the correlation between these columns indicates that, in a lot of cases, if a host has a good score (guest evaluation), s/he also has good scores related to cleanliness of the room/property, good communication with the guest and value, which makes sense. The rest of the variables have very low correlation with each other. Against intuition, correlation between property price and review scores is very, very low...

4.1.4.a) Visualize Categorical Data Distributions � Histogram Plot of all features

We'll now visualize the non-numerical features.

 df_not_num = dfdict['listings'].select_dtypes(include = ['O'])
 print('There are {} non numerical features including:\n{}'.format(len(df_not_num.columns), df_not_num.columns.tolist()))
 df_not_num.head(4)

But we need to remove columns with large text or sentences! E.g., name, description, neighborhood_overview, host_about.

 # 'host_location' is a pretty large column, with the host's full address! We'll not consider it.
 # 'host_neighbourhood' has way too many values, let's ignore it.
 # 'host_verifications' and 'amenities' are multi-value column. 
 # TODO Include 'host_acceptance_rate', 'host_response_rate'
 df_not_num = df_not_num[['neighbourhood_cleansed', 'property_type', 'room_type', 'host_response_time', 'bathrooms_text']]

Let's look at their distribution.

 ncols = 1

 fig, axes = plt.subplots(round(len(df_not_num.columns) / ncols), ncols, figsize=(20, 40))
 #plt.xticks(rotation=90)

 for i, ax in enumerate(fig.axes):
  if i < len(df_not_num.columns):
      ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45)
      ax.set_xticklabels(ax.get_xticklabels(), rotation = 90, ha="right")
      sns.countplot(x=df_not_num.columns[i], alpha=0.7, data=df_not_num, ax=ax)

 fig.tight_layout()

We have already analyzed neighborhood frequency in a previous graph.

We can see that the majority of properties are of the following types: Private room in apartment, Entire apartment, Entire townhouse, Entire house, Private room in townhouse, Entire condominium and Entire serviced apartment.

When it comes to room_type, the vast majority are private rooms and entire home/apartments.

Normally, hosts give a response to their guests within an hour (more than 16000 hosts lie in this category). Less than 4000 hosts take more than a day/a few days or more to answer.

4.2. Bivariate Analysis

If we analyze data by taking two variables/columns into consideration from a dataset, it is known as Bivariate Analysis.

4.2.1. Numeric-Numeric Analysis

Analyzing the two numeric variables from a dataset is known as numeric-numeric analysis. We can analyze it in three different ways.

Scatter Plot
Pair Plot
Correlation Matrix
4.2.1. a) Scatter plot Let�s take the columns �price�, �review_scores_cleanliness� and 'review_scores_rating' from our dataset and see what we can infer by plotting to scatter plot.

 # plot the scatter plot of neighborhood and price variable in data
 plt.figure(figsize=(6, 5))
 plt.scatter(dfdict['listings'].review_scores_cleanliness, dfdict['listings'].price)
 plt.xticks(rotation=90)
 plt.title('review_scores_cleanliness x property price', fontsize =20)
 plt.show()

 # plot the scatter plot of review_scores_rating and price variable in data
 dfdict['listings'].plot.scatter(x="review_scores_rating",y="price")
 plt.title('review_scores_rating x property price', fontsize =20)
 plt.show()

review scores cleanliness x property price

There is not a clear correlation between property rating and the price.

But... In the second graph, we can see that high-priced properties (> 5000 GBP) have only high review-scores ratings (above 80).

4.2.1. b) Pair Plot

Now, let�s plot Pair Plots for some other numerical columns. We�ll use the seaborn library for plotting Pair Plots.

 # plot the pair plot of beds, price and property review scores in dataframe
 plt.figure(figsize = (10, 5))
 sns.pairplot(data = dfdict['listings'], vars=['beds', 'price', 'review_scores_cleanliness', 'review_scores_rating', 
                                            'review_scores_location', 'review_scores_value', 'host_acceptance_rate'])
 plt.show()

It is difficult to find a trend between 2 different variables in the above graph.

4.2.1. c) Correlation Matrix

Since we cannot use more than two variables as x-axis and y-axis in Scatter and Pair Plots, it is difficult to see the relation between three numerical variables in a single graph. In those cases, we�ll use the correlation matrix.

 # Creating a matrix using beds, accommodates, price and some review scores as rows and columns
 xpto = dfdict['listings'][['beds','accommodates','price', 'review_scores_cleanliness', 'review_scores_rating',
                             'review_scores_location', 'review_scores_value', 'host_acceptance_rate']].corr()

 # plot the correlation matrix of these columns in data dataframe
 sns.heatmap(xpto, annot=True, cmap = 'Reds')
 plt.show()

Now let's try to find which features are strongly correlated with price. We'll reuse our df_num dataset (created in 4.1.3.a) to do so.

 df_num.corr()['price']

 df_num_corr = df_num.corr()['price'][:-1] # -1 because it is the latest row in df_num
 golden_features_list = df_num_corr[abs(df_num_corr) > 0.1].sort_values(ascending=False)
 print("There are {} correlated columns with respect to 'price':\n{}".format(len(golden_features_list), golden_features_list))

Correlation is very weak for these variables (c < 0.2).

4.2.2. Numeric - Categorical Analysis

Analyzing the one numeric variable and one categorical variable from a dataset is known as numeric-categorical analysis. We analyze them mainly using mean, median, and box plots.

Let�s take price and neighborhood columns from our dataset. First check for mean value using groupby.

4.2.2. a) Comparing mean and median values of price

 # groupby the listings df to find the mean of the property price according to the city neighborhood.
 dfdict['listings'].groupby('neighbourhood_cleansed')['price'].mean()

There is a lot of price difference between the neighborhoods. City of London has an average price of 258 pounds, while Bexley costs on average 59 pounds. This gives us a price difference of 199 pounds!

Let�s calculate the median,

 # groupby the listings df to find the median of the property price according to the city neighborhood.
 dfdict['listings'].groupby('neighbourhood_cleansed')['price'].median()

The median appears to be less sensitive to outliers. Now, we have: City of London with a median price of 119 pounds and Bexley with a median cost of 40 pounds. The median indicates a price difference of 79 pounds, against an average price difference of 199 pounds.

4.2.2. b) Plot the bar graph of neighborhood x average value of price

 plt.figure(figsize = (12, 6))
 dfdict['listings'].groupby('neighbourhood_cleansed')['price'].mean().plot.bar()
 plt.show()

![bar plot](/projects/project-1/eda-16.png bar-plot")

By the above graph, we can infer that the property price is, on average, higher for more central locations, like City of London and Westminster.

4.2.2. c) Boxplot of the price in function of the neighborhood

Besides median and average values, it is important to observe the interquantile ranges (25%-75%), minimum and maximum values. The box plot gives us this information visually.

 # plot the box plot of price according to neighborhood, after removing outliers where price > 300 GBP
 plt.figure(figsize = (20, 12))
 df_filtered = dfdict['listings'][dfdict['listings'].price <= 300]
 sns.boxplot(df_filtered.price, df_filtered.neighbourhood_cleansed, orient="h", palette="Set2")
 plt.show()

As we can see, when we plot the Box Plot, it paints a very different picture compared to mean and median. City of London, for instance, has the greatest minimum property price, and, together with Kensignton and Chelsea and Westminster, has the highest prices considering the IQR [25%-75&].

4.2.2. d) Time series plot of price and availability

Let's use the calendar dataframe to analyze the evolution of price and availability through time.

There are 27 million lines on the calendar dataframe. Let's see how many different dates we have.

 dfdict['calendar']

The property listings refer to periods starting at Feb 2021 until Feb 2022.

 print('There are', dfdict['calendar'].reset_index()['date'].nunique(), 'days and', 
   dfdict['calendar'].reset_index()['listing_id'].nunique(), 'unique listings on calendar df.')

Let's define a new 0/1 column to show availability.

 dfdict['calendar']['occupied'] = dfdict['calendar']['available'].astype(float) * 100.0

Average price by day

 # print(avg_daily_price)
 avg_daily_price = dfdict['calendar'].reset_index().groupby('date').mean()
 
 # Plotting the Graph
 plt.figure(figsize=(10, 5))
 price_plot_by_day = avg_daily_price['price'].plot(title='Average property prices')
 price_plot_by_day.set_xlabel('Date')
 price_plot_by_day.set_ylabel('Property price')

Property prices have their lowest values on the start of the dataset (Mar, Apr, May), and they start to rise until July (summer season). From July on, they remain stable. Finally price hit a peak near Christmas and New Year's eve.

There is also a strange valley in Fev 2022...

Average availability by day

 avg_daily_occupancy = dfdict['calendar'].reset_index().groupby('date').mean()
 print(avg_daily_occupancy)

 # Plotting the Graph
 plt.figure(figsize=(10, 5))
 occupancy_plot_by_day = avg_daily_occupancy['occupied'].plot(title='Average property occupancy (%)')
 occupancy_plot_by_day.set_xlabel('Date')
 occupancy_plot_by_day.set_ylabel('Property Occupancy')

Average % Property Occupancy for 2021 (i.e. the % of properties already booked, by day) oscillates between 27% and almost 50%, depending on the month. Note: I gave up joining the listing and calendar dataframes, because of OutOfMemory errors in Kaggle Jupyter Kernel :-(

df = dfdict['calendar'].join(dfdict['listings'], on='listing_id', rsuffix='_listing')

df.head(6) sns.lineplot(x="date", y="occupied", hue="", style="event", data=avg_daily_price)

4.2.3. Categorical � Categorical Analysis

Let's see how the different categories like neighborhood, property_type, etc., are associated with each other.

4.2.3. a) Category plot of neighborhood, room type and price in dataframe listings

 first_10_neihborhoods = dfdict['listings']['neighbourhood_cleansed'].unique()[0:10]
 print('first_10_neihborhoods: ', first_10_neihborhoods)
 df_filtered = dfdict['listings'][dfdict['listings']['price'] < 400]
 df_filtered = df_filtered[df_filtered['neighbourhood_cleansed'].isin(first_10_neihborhoods)]
 g = sns.catplot(y="neighbourhood_cleansed", x="price", hue="room_type", kind="bar", data=df_filtered, height=8.27, aspect=11.7/8.27)
 g.set_xticklabels(rotation=90)

Impressive. Sometimes a hotel room can be more expensive than an entire home/apt! For example, this happens in Hounslow and Richmond upon Thames.

4.3. Multivariate Analysis

If we analyze data by taking more than two variables/columns into consideration from a dataset, it is known as Multivariate Analysis. Let�s see how �neighborhood_cleansed�, �bedrooms�, and �price� vary with each other. We�ll create a pivot table with the three columns and after that, we�ll create a heatmap.

 result = pd.pivot_table(data=dfdict['listings'], index='neighbourhood_cleansed', columns='bedrooms',values='price')
 print('Pivot table:\n', result)

 # create heat map of neighbourhood vs price vs availability_rate
 plt.figure(figsize = (12, 6))
 ax = plt.axes()
 sns.heatmap(result, annot=False, cmap = 'RdYlGn', center=0.117, ax=ax)
 ax.set_title('Property price x neighborhood x number of beds')
 plt.show()

Based on the Heatmap above, we can infer that more expensive properties usually have more beds available. Additionally, on more expensive neighborhoods, such as City of London, Westminster and Islington, smaller properties have high prices too.

Stay Tuned

Want to become a Next.js pro?

The best articles, links and news related to web development delivered once a week to your inbox.

BUILT WITH