Q&A with Andy Kriebel from The Information Lab

Who are you and what do you do?

Andy Kriebel - Head Coach at The Information Lab. I run The Data School, a program that creates data analytics gurus.

What does a typical day involve at the Data School?

At the Data School, a typical day is 1/2 training and 1/2 project work. Half of the projects in their 16-week training come from me and the other half are projects we do for our clients for free. The client projects help them gain exposure to clients, experience working with clients, and teach them to work fast while having to learn an entirely new industry each week.

The Data School's approach to onboarding new analysts is relatively unusual (heavily training focused in early months which is a rare approach and discussed in more detail here by The Information Lab’s founder Tom Brown’s post), are you starting to find that some of the clients you work with are starting to take the same approach?

I don't know of any of our clients that are now hiring this way, but they should! As some of the Data Schoolers start rolling out of the program, it's been interesting to see that where they went to school became less important because employers see the value of their experience in The Data School as way more valuable.

I've often described the Tableau community as 'Evangelical', what is it about the product/community or the Data Viz discipline that's created that?

Wow! What a great question! When I started using Tableau and learning about data viz, I needed help along the way and guidance in my learning. Everyone I interacted with was so friendly. That led me to want to give back to the community in the same way. I imagine many people get excited about Tableau for these same reasons that I did.

Following on from the 'Evangelical' theme, your own involvement in all things Data Viz is 'committed' to say the least (Andy’s volume of data viz production is huge from his own site, Makeover Monday, Dear Data Two and elsewhere), what keeps that enthusiasm going?

Now this is an EASY question! The Community keeps me motivated. People intentionally seek me out to thank me. People appreciate it when I give them feedback. Really, it's people being nice to one another that drives me to continue to learn myself and to continue to want to help others.

How much do think Tableau's change in pricing (moving from perpetual to subscription style model) will impact things?

I'm not involved in sales, so it's hard for me to say. From my understanding, it will help companies roll out faster and at their own pace. that can only be good for adoption.

Although a lot of what you do is transferable to other tools, do you ever worry about over-committing to a single tool like Tableau, is there ever a worry you've 'backed the wrong horse'?

Never! Tableau is THE BEST data analysis tool available. I use Tableau every day as a hobbyist. You never hear of other people using other tools as hobbyists the way they talk about Tableau.

Although SAS/SQL are still popular, R (and Python) experience are becoming increasingly required skills. Do you see the future belonging more with these languages or will 'front end' tools like Alteryx remove the need for the deeper level coding understanding?

Before I started using Alteryx, I was learning R and Python working at Facebook. Once I used Alteryx, I was disappointed that I had wasted so much time learning when I could have been more productive. On the other hand, I appreciate that I have a better understand of what R and Python do from having used them, but I'm all for using tools that make the job easier, more intuitive and make me my trainees more productive.

Self-Service is a term often banded about within analytics. Personally, I think it can be dangerous to give people who may not fully understand the data too much control over what they can produce - where do you see the balance between execs being able to create their own views and the risk of them going off on an invalid tangent?

I'm not afraid of this. I believe every person is responsible for their own content. If they don't understand the numbers and can't explain them, then they shouldn't use them. Pretty plain and simple. Better yet, if you're unsure what something means, just ask. An exec shouldn't be too proud or egotistical to ask questions.

The audience for your work may have a range of analytical knowledge/degree of comfort with numbers, how do you tailor your output to ensue you get the message across?

I've taught thousands of people through the years and one thing that has always worked is breaking things down to their simplest form, then building up based on the knowledge of the group your training. Each group I train is different, you have to be confident as a trainer that you can adapt to each unique set of people you are training yet challenge them just beyond what they are learning.

Any recommended sites/books/podcasts to help someone be a better analyst? 

As a matter of fact, I wrote a blog post a year ago listing out 12 books I think any data analyst should read. http://www.vizwiz.com/2016/05/12-books-every-great-data-analyst.html

Soccermatics: Q&A with Professor David Sumpter

Ahead of our 'Football and Data' event on Thursday 9th March, we spoke with Professor David Sumpter Professor of Applied Mathematics Uppsala University, Sweden about the launch of the paperback version of Soccermatics, and his background in data and analysis. 

In the day job what does a typical day involve?

Until recently it involved running a research group of about 10 PhD students and other researchers interesting in modelling social behaviour of animals and humans. I think I have the most brilliant job in the world. We look at so many  interesting problems, such as how fish escape predators, how ants build trail networks, how heavy metal fans build mosh pits and audiences applause, and many more (See www.collective-behavior.com). I work with a great bunch of people. They come in my office, discuss problems with me, go out and work hard on finding answers. 

I say, ‘until recently’, because more and more of the interesting problems have become football problems. Now I have the same type of interactions with football app developers, analysts and journalists. Its very much the same type of highly interactive work.

I also have teaching, which I enjoy. I’ll be teaching probability from next week. There will be lots of football examples in the course!

For any data analysis, what kind of tools are you using?

I use Matlab mostly. Its what I learnt when I was a student and it has stuck. Its easy because I know it. But I also use R and Python if I use data from an API, and its these two which I would recommend for people starting out in this area. Matlab is proprietary software and goes against the spirit of things. 

But to do data analysis properly, you do need to use a programming language like these, with good data analysis tools and good graphical displays. 

For online visualisations d3js is a great thing. But I haven’t mastered it yet. 

Did you always know you'd end up in academia (and are there any differences working in Sweden)?

No. I knew I would go to University but I had no idea why, other than my parents told me I should go and I was good at school. Then I got there and I thought, “this is great”. And wanted to stay. When it is at its best, a career in academia is hard to beat. You get time and space to develop your ideas. 

I moved to Sweden mainly because I fell in love with my wife. It is a good place to raise a family. I could go on parental leave for long periods with both my children. So the big difference in Sweden is the work/life balance. It’s a bit more healthy here.

How did you first get in to analysing data in football?

It was through my son’s interest in the game that I first started thinking about it in mathematical terms. I started coaching his team and thinking about how to talk to them about movement and positioning. Maths is a big part of this.

The key idea behind Soccermatics is that maths can be used in all sorts of different ways: not just in football but in modelling different systems. I had this idea very early and told a literary agent, Chris Wellbelove, about it. He was like “Yeah thats interesting but theres too many animals and stuff like that in what you’ve written….but that bit you wrote about football, can’t we have more of that!”. And thats what happened. He was right. There was load of maths in football. The original version of the book was 3/5ths football and 2/5ths other systems (ant trails, fish schools, human crowds, sexual networks etc…). 

The Pro Edition includes even more football. The more I’ve worked with this, the more interesting things I’ve found to analyse.

What was the reaction to the Hardback version of the book?

What has been most amazing for me personally is when someone writes to me and says they have read the book and been inspired in some way. 

I have had feedback from all different types of people: 18-year-olds who say I’ve read your book and are going to University to study maths; science nerds who tell me that they now realise that football is actually an interesting sport; senior academics who ring me on my work phone to comment various sections and give constructive criticisms; parents who have bought it for their football mad sons or daughters; scouts and analysts at clubs who want to find ways they can use Soccermatics in their work; amateur analysts with technical questions on Twitter….

The list just goes on and on. 

The book has sold well, and my publishers are happy. But it’s the feedback is really what I enjoy most. Interacting with people. It can be difficult to keep up, but I try to reply to every comment or query I get. 

There's quite a bit of talk about role of analysts at clubs e.g., The Secret Analyst do you think someone wanting to work within Performance Analysis/Analytics doing a non-Sports related course then moving in to it after graduating?

Yes. I think if you have the analytical skills take courses in maths, economics, stats, computing etc., while taking coaching badges in your spare time.  This gives you better options for two reasons. Firstly, the quantitative skills are harder (although not impossible) to pick up later. Secondly, when you have finished studying maths or a similar subject, you’ll have skills that lead in to a wide range of jobs, not just in football. Maths can be used everywhere.

The audience for your work may have a range of analytical knowledge/degree of comfort with numbers, how do you tailor your output to ensue you get the message across?

I make a lot of effort to tailor what I write to the audience.

I have worked with quite a few newspapers and football magazines. Journalists are under pressure to have an angle: Ranieri’s sacking, Zlatan’s shooting, best player in the world and so on. When I am asked to help with these, what I try to do is add a bit of data and understanding to the story, but it is difficult to go very deep. Its something I accept. We have to remember that football is a form of entertainment, its not an excuse to give a maths lesson or a lecture in tactics. So I always tailor for what I am asked for. 

But I also try to go in to more depth in other forums. That was part of the reason for the ‘Pro Edition’. I had a chance to go in to more depth in to what real analysts do. I have also tried to do the same in my online series on Medium/Nordic Bet. I look at the ideas in more depth. When Nordic Bet asked me to write something “as geek as you like”. Its the same with Bloomsbury, the book publisher. I can write what I want to, and that is lots of fun and why I write books. 

Any recommended sites/books/podcasts to help someone be a better analyst?

Statsbomb is obviously the best site if you want proper, detailed analysis based on data. They are in a league of their own. Sites run by Statsbomb contributors also tend to be good. If you find someone on Statsbomb then go in and look at their personal site. You’ll find lots of good stuff. 

There are lots of good tactics blogs too. I was lucky enough to have an extensive conversation, during the writing with Reni Maric about the geometry of the game. Until I talked to him, I thought I knew a lot about the geometry of the game, but I soon realised how much planning is required over the whole field. Start with his articles. They are brilliant.

Anatomy of an Analysis - Part 2

In part 1 we looked at some basic ways of manipulating the Land Registry sales data to create some pieces of analysis.

One of the issues with using a more basic tool like Access is that it soon starts to get a bit flaky once the database size gets above 1GB, with the full Land Registry dataset coming in at 3.5GB you obviously need something else to process it in a single go (you could arguably take the multiple yearly files, process each one in a separate database then join the results by year into a single results database if you were that way inclined but that's obviously a fair amount of work).

In this post I'm going to look at a combination of Alteryx and Tableau to create a process and a set of outputs.

Alteryx is an excellent (if not cheap) tool that can create process flows of your data import/processing/analysis and outputs, for the Land Registry data it was a case of importing the file in, 

One feature that works well in Alteryx is the ability to run a process on a sample of the data, in this case rather than having to go through all the 21m+ records in the file you can your process on a sample, test the flow works as you'd expect before running on the whole file.

Instead of code like you might see if using something like SQL or Base SAS, you have a graphical interface more similar to that in SAS Enterprise guide although the layout is much more user-friendly in Alteryx with modules/nodes relating to specific tasks easier to find.

 

Alteryx has a modular structure i.e, load file then add new fields then aggregate etc.,

Alteryx has a modular structure i.e, load file then add new fields then aggregate etc.,

One of the advantages of Alteryx over standard code is it can be easier for someone else to take over (although I know plenty of people who are still loyal to Base SAS and if they can't go through something line by line then they don't want to know).

The workflow above takes in the 21m sales records, creates some new fields (e.g., Postcode area, sales price to nearest thousand pounds etc) and then creates a number of aggregate tables (the main one being median sales price by year by postcode area).

This table is then matched to itself to get the previous years value (if there's an easier function for this let me know) to enable us to create a % change figure per year per postcode area which is then output for use in Tableau.

There's plenty you can do within Tableau in terms of calculated fields, sometimes it'll be best to do the work in processing (in this case Alteryx) although sometimes you'll want more dynamic calculations so would need to be done in Tableau or whatever presentation layer tool is being used.

The tableau side of things is relatively straightforward, after adding in the Geographic pack provided by The Information Lab, it's a fairly easy task to map % change by Postcode Area with each year being a separate page, which gives the video below:

This isn't meant to be some in-depth study of property prices, just a quick way of showing that it's fairly straightforward to get from data to insight. In later parts we'll be looking at other tools and other public sources of data.

Anatomy of an Analysis - Part 1

When trying to get analytical roles now, it’s important to have more than just a good CV. Anyone can say things like ‘I’ve built a segmentation’ but it is now easier than ever to actually show you understand data and what can be done with it.

There’s an ever-increasing pool of Open Data especially related to Government activity and with a bit of coding knowledge also plenty of information that can be collected/scraped from online resources.

The steps below hopefully provide a useful framework of an example of an analytical project, the aim of the analysis here was ‘find something interesting’ rather than a specific ‘how many of x do y’ but the principle still stands as to how to structure your work.

The below analysis was done the ‘old fashioned’ way using Access and Excel, Part 2 will look at using Alteryx and Tableau to enable work with a larger dataset, add a degree of location analysis and hopefully create output that is more user-friendly and dynamic.

Step 1: Get some data

Data for this analysis relates to Price Paid data for UK properties as collated by Land Registry and as such any analysis within this piece is based on data produced by Land Registry © Crown copyright 2016.

Step 2: Load the data

The data is available either in 1 big 3.5GB file covering approx. 20 years or in smaller c150MB yearly files, to keep things simple I’ve loaded 2011-2016 (to date) and collated into a single file. If data was coming from a client, get them to specify number of records sent and match that to what you receive. This doesn’t mean the data’s right but it’s still a worthwhile check.

Step 3: Explore and sense-check the data

Before doing any kind of detailed analysis, it’s always worthwhile doing some exploratory checks and to try and ‘break’ the data. For this part of the analysis at least, make the assumption that there are errors in the data and it’s your job to try and find them.

It’ll save a lot of time (and potential wrong conclusions) if you can discover any errors/omissions or anomalies now rather than when you are a long way down the analysis.

The Land Registry data benefits from a brief ‘Guide to the data’ an awful lot of times of data projects you’ll end up with a bunch of data and end up having to figure things out yourself or embark on a quest to find someone who knows someone who knows someone who might know something about the data. 

Documentation isn’t the most thrilling part of the process but it’ll help others who come to the project afterwards and quite possibly your future self when you have to revisit a project after a few months away from it.

First thing to check when dealing with data that has a time dimension is do you appear to have all of it? This will also start to highlight other areas such as seasonality which could play a part in any analysis.

Registered sales by day gives a relatively even (if growing) pattern per year along with a major spike at the end of March 2016. A bit of research suggests this may be due to the addition of a 3% extra stamp duty on 2nd homes from April 2016 which we’ll look to come back to later.

The other area of note is that although this is data to end of August 2016, the last few days of that month appear to have lower than usual figures suggesting that there’s a level of delay sometimes between registering and appearing in the data. 

An easier way to see this and other possible trends is to chart the data by week rather than day:

Here the pre-Christmas spike and Christmas lull can be seen each year along with the big impact in the last week of March 2016 before the extra 3% for 2nd homes charge kicks in.

Step 4: Test a Hypothesis

As shown by the spike in late March 2016, property activity can be heavily influenced by the tax setup at the time.

Stamp Duty has now gone from the extreme ‘slab’ system which related to the whole purchase price where selling for £1 more could put you in a new banding costing you thousands extra, to a system more like income tax where you have a portion tax free, and then pay a certain % on each subsequent band the chart below shows the difference in Stamp Duty costs by Sales Price pre and post the last major change in the Stamp Duty rules on 3rd Dec 2014:

From the chart above, the difference between the gradual gradient under the new system compared to the old system with sudden steps in Stamp Duty means the removal of an artificial ceiling around certain price points.

The hypothesis to test therefore is has the change in Stamp Duty banding influenced sale prices?

To try and quantify how that’s changed I’ve looked at volumes of sales either side of each of the old thresholds under both systems. Figures below band price paid up to the nearest thousand, so for example everything from £124,001 to £125,000 would all be classified as £125,000.

There’s a couple of interesting things to note in the chart above: One is that the change in distribution doesn’t kick in immediately after the changes in Stamp Duty Rates on 3rd Dec 2014. This is because house purchasing takes a number of months from a property being placed on market to completion, so a lot of properties sold since the new rate came in were priced according to the conventions of the old system.

The other thing that stands out is that rather than being evenly distributed, sales prices tend to peak at £5,000 bands (e.g., at £120k, £125k and £130k) in the example above. 

This is another key thing to bear in mind with any analysis: Although you are looking at numbers, these figures represent the behaviour of real people with their own potentially irrational behaviour. For a property on sale in the £120-£130k bracket, a couple of thousand pounds is quite a big deal but properties generally tend to sell in £5k increments, if a property under the new Stamp Duty banding was ‘worth’ £128k, you’d probably see the seller try and stretch this and try to get £130k and the seller think that £125k was a ‘fair’ price.

There’s probably plenty of scope for both sides if they understand the psychology of pricing where buyers look to find properties that have not quite done enough to justify jumping to the next £5k and similarly for sellers where maybe £2k spend gives enough added value to jump up that £5k step.

We see even more extreme activity at the old £250k and £500k thresholds too:

As under the old system all sales above £250k would have Stamp Duty costs at 3 times that of a sale at £250k (effectively £2.5k vs £7.5k), it’s no surprise that this created an artificial ceiling in property prices around this level.

What appears to have happened under the new levels is that there’s been a sudden jump towards £255k, presumably in part due to the psychological £5k bandings and also a release of those properties where even if £255k were the real value of the property buyers would under the old system be put off paying an extra £5k in stamp duty.

e.g., Under the old system you could buy a property worth £250k and then spend £5k on improving it to be worth £255k (I appreciate that’s a very simplistic view of how property values change), if they’d bought a property at £255k they would have a property worth that but also an extra £5k Stamp Duty bill so would naturally be put off (not least because Stamp Duty would normally be have to be paid in full up front where to buy a property for an extra £5k might cost £500 up front if buying with a 90% Mortgage/ 10% Deposit.

For £500k we see a similar picture:

Moving from £500k to £500,001 under the old system would result in an increase in Stamp Duty of £10k for no real actual extra house for your money so no surprise that sales under the old system just above £500k were negligible but it is a bit surprising that even a year after the introduction of the new duty bands, sales from £500,001 to £504,000 are negligible.

The other main change in Stamp Duty that was mentioned earlier in the analysis was the 3% Stamp Duty surcharge on Second Homes. Although there is no data with the Price Paid data set as to whether a particular sale was subject to this charge, we can look to see if there were any notable changes in behaviour shorty pre/post this cut off similar to those for the changes in December 2014.

The most obvious proxy for second homes that we have available within the dataset will be the sale of flats as a greater proportion of these will be used for 2nd properties/buy-to-let than more expensive detached houses.

The chart below shows the proportion of house sales by property type by month and there’s a noticeable spike in Flat sales in March 2016 with a subsequent dip from April 2016 onwards:

Proportion of sales from Flats goes from roughly 20% to over 26% in the month before the new Stamp Duty threshold came in.

Obviously not all Flat sales will be to 2nd owners so the true impact of this rate on the buy-to-let market will be even bigger than that seen here.

As mentioned at the start of the post, this was done using relatively clunky tools (Access and Excel), they do a job, are cheap and most people will be familiar with them. To get to the next level though, there are better tools available (some of which are actually free) which we will come to in Part 2.