Fantasy Football with Scrapy and scikit-learn (Part 1)

Table of Contents

The code for this project is available here: https://github.com/athompson1991/football. With this being my debut blog post, I’ve decided to break the analysis into two parts as there are many aspects to explain and the article getting to be a bit lengthy.

Intro

Football season has kicked off (pun intended), and likewise so has Fantasy football. I distinctly recall the last time I had a FF draft and how I essentially went in blind, with predictably mediocre results. This season, I wanted to Money Ball it using some machine learning techniques I acquired through classes at UW. Nothing particularly fancy, but enough to provide a semblance of judgement in picking my team.

As often happens with coding projects like this, I began with some scripts to get the data, followed by some scripts to do basic analysis, but the whole thing quickly metastasized into a larger endeavor. Ultimately the scraping work was consolidated into a module containing Scrapy spiders and pipelines/items/settings; the analysis section morphed into a more sophisticated object oriented approach to using scikit-learn; and the ultimate desire of a player ranking for the 2018 season was wrapped into an easily executable script.

Problem Scope

Before jumping into the actual project, I want to explain exactly how I approached this problem. There are important aspects of the Fantasy rules, the scraping, and the analysis that need to be considered.

My league is on Yahoo Fantasy Sports, and I had the following image as my guide on what the team would look like and how points would be gained:

A quick glance at this and a general understanding of football suggest how to launch the analysis: things like offensive fumble return TD or two point conversions can be ignored (very rare) and I can focus on yardage and touchdowns for passing/receiving/rushing (I also decide to predict total receptions for wide receivers).

Also note the make up of positions on the roster. While I could attempt to predict kickers, tight ends, and defense, I decide to simplify my analysis and focus exclusively on predicting the quarterback, wide receiver, and running back performance.

So, in summary, there will be seven response variables to predict (passing/receiving/rushing for TD and yardage, plus total receptions). That leaves the question of what the features will be. To answer this, we take a look at the Pro Football Reference website. Just as an example, take a look at the passing stats:

Many of these stats can be features, such as completion percentage, touchdown counts, and interception counts. Without looking at the stats for other positions (running back and wide receiver), we can kind of guess what will be useful: rushing yards, reception percentages, touchdown counts, etc. Anything numeric, really.

The trick is that we want to use last year’s stats to predict this year’s stats. This will involve a minor trick in Python, but is important to keep in mind. Obviously there is greater correlation between touchdowns and yardage in a given year than between touchdowns last year versus yardage this year.

Scraping

To get the data, I use the Python package Scrapy. Though this is not a tutorial specifically on how to use Scrapy, I will demonstrate the basic approach I took to go from scraping only the passing data, to using a more generalized means of scraping all player data.

Passing

Whenever I do a scraping project, I have to learn the inner workings of the HTML on the target website. For this, I simply looked up the passing stats for a given season, used the Scrapy shell to ping the website, then figured out exactly which cells/td elements to extract and how to do so.

As it turns out, the main table for the passing page is conveniently accessed using the CSS selector table#passing. You can see this by using the inspector in Chrome/Firefox:

Furthermore, all the data in the table is td elements (table data) nested in tr elements (table row). For my purposes, this means that my Scrapy spider will have to zero in on a row, then parse each data element cell by cell in that row. Instead of explaining all the minutia of how to do this, here is my first iteration of the spider to crawl the passing pages:

import scrapy
import bs4
from ..items import PassingItem

FOOTBALL_REFERENCE_URL = 'https://www.pro-football-reference.com'

class PassingSpider(scrapy.Spider):

    name = 'passing'
    allowed_domains = ['pro-football-reference.com']

    def __init__(self):
        super().__init__()
        self.years = list(range(1990, 2019))
        self.urls = [FOOTBALL_REFERENCE_URL + "/years/" + str(year) + "/passing.htm" for year in self.years]

    def parse_row(self, row):
        soup = bs4.BeautifulSoup(row.extract())
        tds = soup.find_all('td')
        if(len(tds) > 0):
            link = tds[0].find('a', href=True)['href']
            player_code = link.split('/')[-1]
            player_code = player_code[0:len(player_code) - 4]
            stats = {td["data-stat"]: td.text for td in tds}
            stats['player_code'] = player_code
            stats['player'] = stats['player'].replace('*', '')
            stats['player'] = stats['player'].replace('+', '')
            stats['pos'] = stats['pos'].upper()
            return stats
        else:
            return {}

    def parse(self, response):
        page = response.url
        self.log(page)
        passing = response.css("table#passing")
        passing_rows = passing.css('tr')
        for row in passing_rows[1:]:
            parsed_row = self.parse_row(row)
            if len(parsed_row) != 0:
                parsed_row['season'] = page.split('/')[-2]
                yield PassingItem(parsed_row)

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)

The important lesson of this is that there is the primary parse method as well a parse_row helper method. A couple quick things to note:

The years I am pulling from are 1990 to 2019
BeautifulSoup is used to parse the HTML
Messy characters are removed
The neat trick to get all the data is on line 24, where I use a dictionary comprehension to pull every statistic in one fell swoop

I will not get into the Scrapy item or pipeline situation, but that is all available on my Github and there is Scrapy documentation is available for reference.

Generalizing for All Player Stats

Once I had written the passing spider, I moved on to get rushing statistics. However, I found myself writing an essentially identical spider. There were only two differences: the last part of the URL was “rushing” instead of “passing”, and the CSS selector was table#rushing instead of table#passing. This seemed like something which could easily be addressed and could avoid me a headache when I moved on to receiving as well.

My solution was inheritance. I wrapped the bulk of the code into a parent class PlayerSpider, then had the detailed particulars of each target page cooked into the inherited classes: PassingSpider, ReceivingSpider, RushingSpider, etc.

I wrote an inherited spider for (almost) all the pages listed under “Player Stats”

Without cluttering everything up with a giant Python class, the idea was to take the super class and base everything around the sub class name like so:

class PlayerSpider(scrapy.Spider):

    allowed_domains = ['pro-football-reference.com']

    def __init__(self, name):
        super().__init__()
        self.years = list(YEARS)
        self.urls = [FOOTBALL_REFERENCE_URL + "/years/" + str(year) + "/" + name + ".htm" for year in self.years]

...

    def parse(self, response, target, item_class):
        self.log("parsing row...")
        page = response.url
        table = response.css("table#" + target)
        table_rows = table.css('tr')
        for row in table_rows[1:]:
            parsed_row = self.parse_row(row)
            if len(parsed_row) != 0:
                parsed_row['season'] = page.split('/')[-2]
                yield item_class(parsed_row)

class PassingSpider(PlayerSpider):
    name = 'passing'

    def __init__(self):
        super().__init__(PassingSpider.name)

    def parse(self, response):
        return super().parse(response, target=PassingSpider.name, item_class=PassingItem)

The trick is in using the __init__ methods as a way to establish which page we are looking at, as well as what the table will be named (in other words, exactly the problem described above regarding passing versus rushing). The parsing methods on the parent class need to be modified slightly to account for more string manipulation issues, and the Scrapy item needs to be modified as well to adjust to different column headers, but otherwise the process is very similar for every statistic type (receiving, rushing, passing).

With some quick, additional Scrapy items and a CSV pipeline that spells out the columns to expect and where to save the data, I can easily pull all data of interest from 1990 to 2019: passing, rushing, receiving, defense, kicking, and fantasy.

With the “database” successfully established, we can now move on to the actual data analysis.

How I feel when I successfully scrape a bunch of data

Exploratory Data Analysis

With exploratory data analysis, I sought to do two things – manipulate the data into something that could be fed into a regression model, and get a cursory understanding of exactly what kind of relationships could be valuable.

Manipulating the Data

The idea of the predictions is to use present data to predict the next season’s performance. This is on an individual player level, and the type of performance (rushing, passing, receiving) is the basis of analysis.

To accomplish this, the underlying data has to be joined with itself – think of it as a SQL join where the column you are joining on is itself plus/minus one. The Python code is defined in the function make_main_df :

def make_main_df(filename):
    raw_data = pd.read_csv(filename)
    prev_season = raw_data.copy()
    prev_season['lookup'] = prev_season['season'] + 1
    main = pd.merge(
        raw_data,
        prev_season,
        left_on=['player_code', 'season'],
        right_on=['player_code', 'lookup'],
        suffixes=('_now', '_prev')
    )
    return main

Notice that the season field is joined on itself plus one and that the player_code is also part of the merge. The columns are renamed with suffixes that are self explanatory. If we use the Analyzer class I wrote for this project (more on that in the sequel to this blog post), we can see what kinds of columns this gives us for our analysis of passing data.

from football.analysis.analyzer import Analyzer

analyzer = Analyzer("script/analysis_config.json")
analyzer.set_analysis("passing")
analyzer.create_main_df()

analyzer.main.columns
 Index(['season_now', 'player_now', 'player_code', 'team_now', 'age_now',
        'pos_now', 'g_now', 'gs_now', 'qb_rec_now', 'pass_cmp_now',
        'pass_att_now', 'pass_cmp_perc_now', 'pass_yds_now', 'pass_td_now',
        'pass_td_perc_now', 'pass_int_now', 'pass_int_perc_now',
        'pass_long_now', 'pass_yds_per_att_now', 'pass_adj_yds_per_att_now',
        'pass_yds_per_cmp_now', 'pass_yds_per_g_now', 'pass_rating_now',
        'qbr_now', 'pass_sacked_now', 'pass_sacked_yds_now',
        'pass_net_yds_per_att_now', 'pass_adj_net_yds_per_att_now',
        'pass_sacked_perc_now', 'comebacks_now', 'gwd_now', 'season_prev',
        'player_prev', 'team_prev', 'age_prev', 'pos_prev', 'g_prev', 'gs_prev',
        'qb_rec_prev', 'pass_cmp_prev', 'pass_att_prev', 'pass_cmp_perc_prev',
        'pass_yds_prev', 'pass_td_prev', 'pass_td_perc_prev', 'pass_int_prev',
        'pass_int_perc_prev', 'pass_long_prev', 'pass_yds_per_att_prev',
        'pass_adj_yds_per_att_prev', 'pass_yds_per_cmp_prev',
        'pass_yds_per_g_prev', 'pass_rating_prev', 'qbr_prev',
        'pass_sacked_prev', 'pass_sacked_yds_prev', 'pass_net_yds_per_att_prev',
        'pass_adj_net_yds_per_att_prev', 'pass_sacked_perc_prev',
        'comebacks_prev', 'gwd_prev', 'lookup'],
       dtype='object')

Analyzing the Data

One thing that might be interesting to look at is the relationship between touchdowns and yardage. Is there any predictive power there?

Here is a plot of touchdown count versus yardage, per quarterback (so each dot is the quarterback, but any given quarterback could have many seasons plotted)

There is clearly a great relationship between these two variables (because obviously there would be). But do touchdowns this season provide any help in predicting next season‘s yardage? Here is that picture:

The relationship becomes much noisier. However, there does seem to be a nice, upward trend once the low touchdown/yardage observations are removed. If we filter results to “cleaner” quarterback observations, we get this plot:

This doesn’t look terrible! How does the current touchdown count versus next season’s touchdown count look?

Again, this is a filtered dataset and there does seem to be a degree of correlation present. If we plot the correlation matrix as a heatmap, we can get a better idea of exactly which of the variables available have predictive power and which do not.

There is a clear break between the “now” data and the “previous” data, as expected. However the large block in the middle of the triangle is of interest – this is what we will use to develop our models, and at first glance there does seem to be correlation.

I am going to leave the EDA at that because I could repeat this exercise for all the other positions and at the end of the day I am simply trying to use whatever is available to make an informed decision. The important conclusion from this analysis, though, is that there are quite a few numerical fields in each position to produce analysis and that, at first glance, there is reason to suspect something of value can be developed.

Conclusion

That’s all for now. I will explain the inner workings of my preprocessing, regressions, hyperparameter tuning, and predictions in the next post. I hope you found this overview of the web scraping and exploratory data analysis work useful and interesting!