Making data science accessible – Machine Learning – Tree Methods

What are Tree Methods?

 Tree methods are commonly used in data science to understand patterns within data and to build predictive models. The term Tree Methods covers a variety of techniques with different levels of complexity but my aim is to highlight three I find useful. To set the problem up let’s assume we have a census dataset containing age, education, employment status and so on. Given all this information we want to see if we can predict whether a person earns more than $50k per year. How can tree methods help us?

 

Decision Trees

A simple decision tree is the easiest approach to understand. The model tries to find the variable that best splits between high and low income and the optimal point at which to make this split. In the example below the model finds that age is the most important splitter when predicting income > $50k and so forms the first branch of the tree (people in the data who were less than 35 have a lower likelihood of earning >$50k per year).

You can continue to split your data, making the tree deeper and deeper but there is a key trade-off here. The deeper the tree, the more powerful the model becomes at being able to explain the data but also the more over-fit the model becomes. This means that whilst it does a great job of explaining the exact data it is trained upon it may do a worse job at predicting new data. To help make this trade-off we often build a model on a portion of a dataset and test the model power on the remaining sample.

Simple decision trees are useful for basic models or to understand high level data however the major downside is that the larger the tree becomes the more your sample is sliced and diced leaving very small sample sizes at the lowest level. This restricts the techniques ability to build more complex models. To take this further a new approach is required to collect together multiple trees (called an ensemble method). At Capital One we use both Random Forests and Gradient Boosted Machines to do this.

 

Random Forests

The Random Forests technique builds multiple trees by randomly sampling both rows from the data with replacement (also known as Bagging). Where this technique deviates from the Bagging technique is that the input variables are also randomly sampled as well as the rows. This sampling and tree building process happens many times and then the predictions from all trees built are combined (often through a simple average) to give a final prediction.

This approach overcomes the issues with running out of sample as the trees built are generally smaller than the optimal tree. The re-sampling of rows with replacement helps the model capture variation in the data better and helps guard against over-fit. One potential downside of Bagging is that if there are a few very strongly predictive inputs they may dominate the trees leading to highly correlated predictions and minimal variance of prediction. The sampling of input variables helps overcome this risk as this leads to more varied trees.

 

Gradient Boosted Machines (GBMs)

GBMs differ from Random Forests as rather than building a series of trees independently and pooling the predictions each tree builds on the previous tree’s predictions. A basic first tree is built and scored out on the sample to create both the predictions and the residuals from the tree (actual outcome minus the prediction). The second phase is to then build another tree to prediction these residuals. This process then continues, building trees on the residuals from the prior tree until an optimal tree is created.

As with the other tree methods it is important to not over-fit to the data. There are lots of options to run the process including: restriction the size of each tree, penalizing more complex trees, controlling the influence of a single tree and sampling approaches for each tree build. When building a GBM it is important to know what settings have been used to understand the output. If they are well understood GBM models can often be some of the most predictive (for example they are often seen in winning Kaggle entrants).

 

Examples where we’ve used Tree Methods at Capital One:

 

Model building

Basic decision trees are really useful for building quick, simple models that can be easily understood and implemented. There are cases where the in-market business impact can be huge from building a quick tree and deploying in a short period of time rather than the more time and resource intensive build/deployment of a complex model. Knowing the trade-offs in techniques and having a really clear understanding of the business need helps the UK Data Science team to build the right type of model.

 

Data exploration

Trees are very useful at the outset of a modelling project to try and understand the relationships within your dataset. We use GBMs to understand the influential variables in a model upfront. They are also useful for quickly reducing the field of potential splitters, focusing in on the data that matters and a quick unconstrained GBM model can act as a useful benchmark model, allowing the Data Scientist to gauge how predictive their final model may be.

 

When would I use Tree Methods?

As with any technique related to data science Tree Methods are one of many approaches you could take to solve a business problem using large amounts of data. The key is being able pick and choose when to take Tree Methods off the shelf. At a high level: Tree Methods may help you with a prediction problem, given the warning around potential over-fit.

 

Credits: Sarah Pollicott, Carola Deppe, Sarah Johnston, Kevin Chisholm

14 thoughts on “Making data science accessible – Machine Learning – Tree Methods

  1. I simply want to mention I am just very new to weblog and actually loved your web-site. Likely I’m likely to bookmark your blog . You actually have perfect articles and reviews. Kudos for sharing your website.

  2. I keep listening to the newscast talk about getting free online grant applications so I have been looking around for the finest site to get one. Could you advise me please, where could i get some?

  3. I think this is among the most important info for me. And i’m glad reading your article. But want to remark on some general things, The website style is wonderful, the articles is really great : D. Good job, cheers

  4. I was recommended this blog by my cousin. I’m not sure whether this post is written by him as no one else know such detailed about my problem. You are incredible! Thanks!

  5. Faytech specializes in the design, development, manufacturing and marketing of Capacitive touch screen, Resistive touch screen, Industrial touch screen, IP65 touch screen, touchscreen monitors and integrated touchscreen PCs. Contact us at http://www.faytech.us, 121 Varick Street,3rd Floor,New York, NY 10013,+1 646 205 3214

  6. Keep up the excellent work , I read few content on this site and I believe that your web site is really interesting and has got circles of fantastic information.

  7. Thanks , I have recently been searching for information about this subject for a while and yours is the best I’ve discovered so far. But, what concerning the bottom line? Are you certain about the source?

  8. I simply wanted to compose a remark so as to express gratitude to you for these fantastic techniques you are giving at this site. My extended internet lookup has at the end been paid with beneficial facts and strategies to talk about with my neighbours. I would assume that we visitors are definitely lucky to be in a really good place with very many wonderful individuals with very helpful hints. I feel truly blessed to have seen your entire web site and look forward to really more brilliant times reading here. Thanks a lot once more for everything.

  9. Thanks so much for giving everyone a very spectacular possiblity to read from this blog. It really is very brilliant and also full of a lot of fun for me personally and my office acquaintances to search the blog at the very least 3 times per week to read the new tips you have. Of course, I’m at all times amazed for the mind-blowing techniques you serve. Certain 2 facts on this page are undoubtedly the finest we have all ever had.

  10. I precisely desired to thank you so much all over again. I do not know what I could possibly have done without the entire opinions documented by you relating to such a topic. It absolutely was a real horrifying circumstance for me personally, nevertheless being able to see a professional strategy you managed that forced me to jump for fulfillment. Extremely thankful for your assistance and believe you find out what a great job that you’re undertaking educating some other people by way of your webpage. I am certain you’ve never got to know all of us.

  11. Have you ever thought about publishing an ebook or guest authoring on other websites?
    I have a blog based upon on the same information you discuss and would really like to have you
    share some stories/information. I know my readers would value your work.
    If you’re even remotely interested, feel free to
    send me an e mail.

  12. We are a bunch of volunteers and opening a brand new scheme in our community.
    Your site offered us with helpful information to work on. You have performed
    an impressive activity and our whole neighborhood will probably be
    thankful to you.

  13. I got this web page from my buddy who informed me regarding this site and at the moment this time I
    am visiting this site and reading very informative posts at this time.

Leave a Reply

Your email address will not be published. Required fields are marked *