Anatomy of an Analysis - Part 1

When trying to get analytical roles now, it’s important to have more than just a good CV. Anyone can say things like ‘I’ve built a segmentation’ but it is now easier than ever to actually show you understand data and what can be done with it.

There’s an ever-increasing pool of Open Data especially related to Government activity and with a bit of coding knowledge also plenty of information that can be collected/scraped from online resources.

The steps below hopefully provide a useful framework of an example of an analytical project, the aim of the analysis here was ‘find something interesting’ rather than a specific ‘how many of x do y’ but the principle still stands as to how to structure your work.

The below analysis was done the ‘old fashioned’ way using Access and Excel, Part 2 will look at using Alteryx and Tableau to enable work with a larger dataset, add a degree of location analysis and hopefully create output that is more user-friendly and dynamic.

Step 1: Get some data

Data for this analysis relates to Price Paid data for UK properties as collated by Land Registry and as such any analysis within this piece is based on data produced by Land Registry © Crown copyright 2016.

Step 2: Load the data

The data is available either in 1 big 3.5GB file covering approx. 20 years or in smaller c150MB yearly files, to keep things simple I’ve loaded 2011-2016 (to date) and collated into a single file. If data was coming from a client, get them to specify number of records sent and match that to what you receive. This doesn’t mean the data’s right but it’s still a worthwhile check.

Step 3: Explore and sense-check the data

Before doing any kind of detailed analysis, it’s always worthwhile doing some exploratory checks and to try and ‘break’ the data. For this part of the analysis at least, make the assumption that there are errors in the data and it’s your job to try and find them.

It’ll save a lot of time (and potential wrong conclusions) if you can discover any errors/omissions or anomalies now rather than when you are a long way down the analysis.

The Land Registry data benefits from a brief ‘Guide to the data’ an awful lot of times of data projects you’ll end up with a bunch of data and end up having to figure things out yourself or embark on a quest to find someone who knows someone who knows someone who might know something about the data. 

Documentation isn’t the most thrilling part of the process but it’ll help others who come to the project afterwards and quite possibly your future self when you have to revisit a project after a few months away from it.

First thing to check when dealing with data that has a time dimension is do you appear to have all of it? This will also start to highlight other areas such as seasonality which could play a part in any analysis.

Registered sales by day gives a relatively even (if growing) pattern per year along with a major spike at the end of March 2016. A bit of research suggests this may be due to the addition of a 3% extra stamp duty on 2nd homes from April 2016 which we’ll look to come back to later.

The other area of note is that although this is data to end of August 2016, the last few days of that month appear to have lower than usual figures suggesting that there’s a level of delay sometimes between registering and appearing in the data. 

An easier way to see this and other possible trends is to chart the data by week rather than day:

Here the pre-Christmas spike and Christmas lull can be seen each year along with the big impact in the last week of March 2016 before the extra 3% for 2nd homes charge kicks in.

Step 4: Test a Hypothesis

As shown by the spike in late March 2016, property activity can be heavily influenced by the tax setup at the time.

Stamp Duty has now gone from the extreme ‘slab’ system which related to the whole purchase price where selling for £1 more could put you in a new banding costing you thousands extra, to a system more like income tax where you have a portion tax free, and then pay a certain % on each subsequent band the chart below shows the difference in Stamp Duty costs by Sales Price pre and post the last major change in the Stamp Duty rules on 3rd Dec 2014:

From the chart above, the difference between the gradual gradient under the new system compared to the old system with sudden steps in Stamp Duty means the removal of an artificial ceiling around certain price points.

The hypothesis to test therefore is has the change in Stamp Duty banding influenced sale prices?

To try and quantify how that’s changed I’ve looked at volumes of sales either side of each of the old thresholds under both systems. Figures below band price paid up to the nearest thousand, so for example everything from £124,001 to £125,000 would all be classified as £125,000.

There’s a couple of interesting things to note in the chart above: One is that the change in distribution doesn’t kick in immediately after the changes in Stamp Duty Rates on 3rd Dec 2014. This is because house purchasing takes a number of months from a property being placed on market to completion, so a lot of properties sold since the new rate came in were priced according to the conventions of the old system.

The other thing that stands out is that rather than being evenly distributed, sales prices tend to peak at £5,000 bands (e.g., at £120k, £125k and £130k) in the example above. 

This is another key thing to bear in mind with any analysis: Although you are looking at numbers, these figures represent the behaviour of real people with their own potentially irrational behaviour. For a property on sale in the £120-£130k bracket, a couple of thousand pounds is quite a big deal but properties generally tend to sell in £5k increments, if a property under the new Stamp Duty banding was ‘worth’ £128k, you’d probably see the seller try and stretch this and try to get £130k and the seller think that £125k was a ‘fair’ price.

There’s probably plenty of scope for both sides if they understand the psychology of pricing where buyers look to find properties that have not quite done enough to justify jumping to the next £5k and similarly for sellers where maybe £2k spend gives enough added value to jump up that £5k step.

We see even more extreme activity at the old £250k and £500k thresholds too:

As under the old system all sales above £250k would have Stamp Duty costs at 3 times that of a sale at £250k (effectively £2.5k vs £7.5k), it’s no surprise that this created an artificial ceiling in property prices around this level.

What appears to have happened under the new levels is that there’s been a sudden jump towards £255k, presumably in part due to the psychological £5k bandings and also a release of those properties where even if £255k were the real value of the property buyers would under the old system be put off paying an extra £5k in stamp duty.

e.g., Under the old system you could buy a property worth £250k and then spend £5k on improving it to be worth £255k (I appreciate that’s a very simplistic view of how property values change), if they’d bought a property at £255k they would have a property worth that but also an extra £5k Stamp Duty bill so would naturally be put off (not least because Stamp Duty would normally be have to be paid in full up front where to buy a property for an extra £5k might cost £500 up front if buying with a 90% Mortgage/ 10% Deposit.

For £500k we see a similar picture:

Moving from £500k to £500,001 under the old system would result in an increase in Stamp Duty of £10k for no real actual extra house for your money so no surprise that sales under the old system just above £500k were negligible but it is a bit surprising that even a year after the introduction of the new duty bands, sales from £500,001 to £504,000 are negligible.

The other main change in Stamp Duty that was mentioned earlier in the analysis was the 3% Stamp Duty surcharge on Second Homes. Although there is no data with the Price Paid data set as to whether a particular sale was subject to this charge, we can look to see if there were any notable changes in behaviour shorty pre/post this cut off similar to those for the changes in December 2014.

The most obvious proxy for second homes that we have available within the dataset will be the sale of flats as a greater proportion of these will be used for 2nd properties/buy-to-let than more expensive detached houses.

The chart below shows the proportion of house sales by property type by month and there’s a noticeable spike in Flat sales in March 2016 with a subsequent dip from April 2016 onwards:

Proportion of sales from Flats goes from roughly 20% to over 26% in the month before the new Stamp Duty threshold came in.

Obviously not all Flat sales will be to 2nd owners so the true impact of this rate on the buy-to-let market will be even bigger than that seen here.

As mentioned at the start of the post, this was done using relatively clunky tools (Access and Excel), they do a job, are cheap and most people will be familiar with them. To get to the next level though, there are better tools available (some of which are actually free) which we will come to in Part 2.