Recently I’ve been working with the Land Registry’s price paid data set looking at shifts in prices in different areas of the market. One of the ways I’ve been segmenting the massive amounts of data into something more manageable has been to look at specific deciles, say the top 10% of the market. Deciles, and the like have been put to great use recently in the literature on income and wealth, ‘the 1%’ as a phrase we all now instantly ‘get’ being the perfect example. Unfortunately I haven’t found much information on working with datasets in this way, so I thought I would do a quick write up on how to do this in python with pandas in the hope that it would be of use to other users.
First we need to get a dataset to work with. While the price paid dataset I’ve been using is fun and relatively easy to work with, it is very, very large and unless you have enterprise levels of ram on your computer the examples given bellow will cause all-sorts of horrible memory errors. For the purpose of this example lets just start by importing the modules we will need and generating some random noise data.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import datetime import matplotlib.dates as mdates
range = pd.date_range('2014-01-01', '2017-12-31', freq='15min') df = pd.DataFrame(index = range)
df['price'] = np.random.randint(low=0, high=1000000, size=len(df.index))
Now we have a nice dataset with a 140,161 price transactions all at neatly spaced 15 minute intervals with the time logged in the dataframes index.
Using resample we can now easily get working on the data as a timeseries.
mean = df['price'].resample('M', how='mean')
Here we create a series called mean, with each entry in the series representing the mean from one months worth of transactions. We can see this quite nicely plotted into a graph:
Of course your graph will look a bit different since we are using random data.
Another way to write the above is as:
mean = df['price'].resample('M', how=lambda x: x.mean())
Here, instead of using resample’s built in mean we are using a lambda, essentially a temporary function, to pass on each months worth of data to pandas’s mean function. The result, is of course, exactly the same, but this example will hopefully make some of the later steps easier to follow.
Next we want to start splitting our data into deciles. First here is an easy way to split the data by value:
df[df['price'] > 500000]
Here we are returned a full dataframe for all values over 500,000. To select between two values we can do:
df[df['price'] > 500000][df['price'] < 600000]
Now we simply need to apply decile calculations to this by using the quantile function.
df[df['price'] >= df['price'].quantile(.90)]
Here the quantile function assesses the average for the whole data frame and returns every entry that falls within the top 10% band.
Now we can combine these two techniques to allow us to calculate the decile bands each month and then do an additional calculation for each speific band in each specific month.
top_mean = df['price'].resample('M', how=lambda x: x[x >= x.quantile(.90)].mean())
Which gives us:
Just to illustrate the difference this makes, here is a graph with one of the decile band calculated incorrectly because the banding was taken for the whole data set rather than each month!
p.s. just incase anyone is interested here is the code I used to generate this graph.
top_mean_WRONG = df['price'].resample('M', how=lambda x: x.quantile(.90)) #THIS IS THE WRONG WAY TO DO THIS fig, ax = plt.subplots(figsize=(20,10)) years = mdates.YearLocator() # every year months = mdates.MonthLocator() # every month yearsFmt = mdates.DateFormatter('%Y') # format the ticks ax.xaxis.set_major_locator(years) ax.xaxis.set_major_formatter(yearsFmt) ax.xaxis.set_minor_locator(months) datemin = datetime.date(df.index.date.min().year, 1, 1) #The date axis datemax = datetime.date(df.index.date.max().year + 1, 1, 1) ax.set_xlim(datemin, datemax) plt.title('Random Numbers Plotted') plt.plot(top_mean, label='Mean with decile calculated each month') plt.plot(top_mean_WRONG, label='Mean where decile is INCORRECTLY calculated for the whole dataset') legend = ax.legend(loc=3)