April 4, 2018

Scraping the NFL Salary Cap Data with Python and R

The NFL Data

[SporTrac](http://www.sportrac.com] has a wonderful array of financial data on sports. A student going to work for the Seattle Seahawks wanted the NFL salary cap data and I also found data on the English Premier League there. Now I have a source to scrape the data from.

With a source in hand, the key tool is the SelectorGadget. SelectorGadget is a browser add-in for Chrome that allows us to select text and identify the css or xpath selector to scrape the data. With that, it becomes easy to identify what we need. I navgiated to the website in base_url and found the team names had links to the cap data. I will use those links first. Let’s build that.

library(rvest)
base_url <- "http://www.spotrac.com/nfl/"
read.base <- read_html(base_url)
team.URL <- read.base %>% html_nodes(".team-name") %>% html_attr('href')
team.URL
##  [1] "https://www.spotrac.com/nfl/arizona-cardinals/cap/"       
##  [2] "https://www.spotrac.com/nfl/atlanta-falcons/cap/"         
##  [3] "https://www.spotrac.com/nfl/baltimore-ravens/cap/"        
##  [4] "https://www.spotrac.com/nfl/buffalo-bills/cap/"           
##  [5] "https://www.spotrac.com/nfl/carolina-panthers/cap/"       
##  [6] "https://www.spotrac.com/nfl/chicago-bears/cap/"           
##  [7] "https://www.spotrac.com/nfl/cincinnati-bengals/cap/"      
##  [8] "https://www.spotrac.com/nfl/cleveland-browns/cap/"        
##  [9] "https://www.spotrac.com/nfl/dallas-cowboys/cap/"          
## [10] "https://www.spotrac.com/nfl/denver-broncos/cap/"          
## [11] "https://www.spotrac.com/nfl/detroit-lions/cap/"           
## [12] "https://www.spotrac.com/nfl/green-bay-packers/cap/"       
## [13] "https://www.spotrac.com/nfl/houston-texans/cap/"          
## [14] "https://www.spotrac.com/nfl/indianapolis-colts/cap/"      
## [15] "https://www.spotrac.com/nfl/jacksonville-jaguars/cap/"    
## [16] "https://www.spotrac.com/nfl/kansas-city-chiefs/cap/"      
## [17] "https://www.spotrac.com/nfl/las-vegas-raiders/cap/"       
## [18] "https://www.spotrac.com/nfl/los-angeles-chargers/cap/"    
## [19] "https://www.spotrac.com/nfl/los-angeles-rams/cap/"        
## [20] "https://www.spotrac.com/nfl/miami-dolphins/cap/"          
## [21] "https://www.spotrac.com/nfl/minnesota-vikings/cap/"       
## [22] "https://www.spotrac.com/nfl/new-england-patriots/cap/"    
## [23] "https://www.spotrac.com/nfl/new-orleans-saints/cap/"      
## [24] "https://www.spotrac.com/nfl/new-york-giants/cap/"         
## [25] "https://www.spotrac.com/nfl/new-york-jets/cap/"           
## [26] "https://www.spotrac.com/nfl/philadelphia-eagles/cap/"     
## [27] "https://www.spotrac.com/nfl/pittsburgh-steelers/cap/"     
## [28] "https://www.spotrac.com/nfl/san-francisco-49ers/cap/"     
## [29] "https://www.spotrac.com/nfl/seattle-seahawks/cap/"        
## [30] "https://www.spotrac.com/nfl/tampa-bay-buccaneers/cap/"    
## [31] "https://www.spotrac.com/nfl/tennessee-titans/cap/"        
## [32] "https://www.spotrac.com/nfl/washington-football-team/cap/"
# Clean up the URLs to get the team names by themselves.
team.names <- gsub("/cap/","", gsub(base_url, "", team.URL))

With 32 team links to scrape, I can now scrape them.

Grabbing Team Tables

Now I need to explore those links. When I started this, I wanted to learn some Python along the way. I found the following Python code that worked for the task. For completeness, I leave it here.

import pandas as pd
import bs4
import re
import urllib2
#import requests

from urllib import urlopen
from bs4 import BeautifulSoup
base_url = "http://www.spotrac.com/nfl/"

def get_page(url):
    page = urlopen(base_url)
    soup = BeautifulSoup(page, 'lxml')
    file = open("spotrac_urls.txt", 'w')
    file.write(str(soup))
    file.close()

def get_team_table(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')

get_page(base_url)

with open("spotrac_urls.txt", 'r') as file:
    for line in file:
        line = line.strip()

from bs4 import BeautifulSoup
page = open("spotrac_urls.txt", 'r')
soup = BeautifulSoup(page, "lxml")
div = soup.find("div","subnav-posts")

links = div.find_all('a')
for link in links:
    print(link.get('href'))

len(links)

def get_team_table(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    data_rows = [row for row in soup.find("table", "datatable").find_all("tr")]
    return data_rows

# create an empty list
team_data = []

for link in links:
    team_data.append(get_team_table(link.get('href')))

len(team_data)

#data_rows = [row for row in soup.find("td", "center").find_all("tr")]
table_data = []

#soup = BeautifulSoup(team_data[0], 'lxml')

#This needs to be a nested for loop because inner items of the list are BeautifulSoup Elements
for row in team_data:
    for element in row:
        #print(type(element))
        if soup.find_all("td", attrs={"class":" right xs-hide "}) is not None:
            table_data.append(element.get_text())

player_data = []
for row in table_data:
    player_data.append(row.split("\n"))
    #print(player_data)

len(player_data)

import pandas as pd
df = pd.DataFrame(player_data)
df = df.drop(14, 1)
df = df.drop(0, 1)
df = df.drop(1, 1)


df = df.drop(df.index[[0]])
df.set_index(1, inplace=True)
print(df.shape)
df.head()

players = []
for row in team_data[0]:
    if row.get_text("tr") is not None:
        players.append(row) 

column_headers = [col.get_text() for col in players[0].find_all("th") if col.get_text()]
len(column_headers)

df.columns = column_headers
df.head()

#The header repeated itself in the data.  This didn't reveal itself until the data type conversion step below
#but this fixes all occurrences of it.
rows_to_be_dropped = df.loc[df['Cap Hit'] == 'Cap %'].index
rows_to_be_kept = df.loc[df['Cap Hit'] != 'Cap %'].index
totlen = len(df)
df2a = df.drop(rows_to_be_kept)
df = df.drop(rows_to_be_dropped)
df2 = pd.Series(data=rows_to_be_dropped)

#Apply a regex to convert the 'Cap Hit' column from a string to a float.  
# df['Cap Hit'] =(df['Cap Hit'].replace('[\$,)]', "", regex=True).replace( '[(]','-',   regex=True ).astype(float))

# My fix
df['Cap.Hit'] = (df['Cap Hit'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).replace('\s\s.*', "", regex=True).astype(float))
df['Base.Salary'] = (df['Base Salary'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Signing.Bonus'] = (df['Signing Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Roster.Bonus'] = (df['Roster Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Option.Bonus'] = (df['Option Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Workout.Bonus'] = (df['Workout Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Restruc.Bonus'] = (df['Restruc. Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Misc'] = (df['Misc.'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Dead.Cap'] = (df['Dead Cap'].replace('\(',"",regex=True).replace('\)',"",regex=True).replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))

#Sanity check to make sure it worked.
df['Cap Hit'].sum()

There are a few wonderful things about Beautiful Soup. But it is still faster and easier for me to do things in R. Here is some R code for the same task. The R table is a messy data structure but is easier to get.

library(stringr)
data.creator <- function(link) {
read_html(link) %>% html_nodes("table") %>% html_table(header=TRUE, fill=TRUE) -> res
names(res) <- c("Active","Draft","Inactive","Team1","Team2")
return(res)
  }
team.names <- gsub("-", " ", team.names)
simpleCap <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(toupper(substring(s, 1,1)), substring(s, 2),
      sep="", collapse=" ")
}
team.names <- sapply(team.names, simpleCap)
NFL.scrape <- sapply(team.URL, function(x) {data.creator(x)})
names(NFL.scrape) <- team.names 
# This is a hack but it works
Actives <- lapply(NFL.scrape, function(x){x$Active})
rep.res <- sapply(seq(1,32), function(x) {dim(Actives[[x]])[[1]]})
clean.me.2 <- function(data) {
data %>% str_remove_all("[\\t]") %>% str_split("\\n\\n\\n") %>% unlist() %>% matrix(data=., ncol=2, byrow=TRUE) -> dat
return(dat)
}
clean.me <- function(data) {
str_remove_all(data, "[\\t]")
}
clean.me.num <- function(data) {
str_remove_all(data, "[\\$,()\\-]")
  }
Players <- lapply(Actives, function(x){ clean.me.2(x[,1])})
Last.Name <- unlist(lapply(Players, function(x) {x[,1]}))
Player.Name <- unlist(lapply(Players, function(x) {x[,2]}))
Team <- rep(names(Actives),rep.res)
Position <- unlist(lapply(Actives, function(x){ clean.me(x[,2])}))
Base.Salary <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Base Salary'])}))
Base.Salary <- as.numeric(Base.Salary)
Signing.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Signing Bonus'])}))
Signing.Bonus <- as.numeric(Signing.Bonus)
Roster.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Roster Bonus'])}))
Roster.Bonus <- as.numeric(Roster.Bonus)
Option.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Option Bonus'])}))
Option.Bonus <- as.numeric(Option.Bonus)
Workout.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Workout Bonus'])}))
Workout.Bonus <- as.numeric(Workout.Bonus)
Cap.Pct <- unlist(lapply(Actives, function(x){ x[,'Cap %']}))
Restruc.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Restruc. Bonus'])}))
Restruc.Bonus <- as.numeric(Restruc.Bonus)
Dead.Cap <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Dead Cap'])}))
Dead.Cap <- as.numeric(Dead.Cap)
Misc.Cap <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Misc.'])}))
Misc.Cap <- as.numeric(Misc.Cap)
Cap.Hit <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Cap Hit'])}))
Cap.Hit <- str_replace_all(Cap.Hit, pattern="\\s\\s*", "")
NFL.Salary.Data <- data.frame(Player=Player.Name, 
                       Last.Name = Last.Name,
                       Team=Team, 
                       Position=Position, 
                       Base.Salary=Base.Salary,
                       Signing.Bonus=Signing.Bonus,
                       Roster.Bonus=Roster.Bonus,
                       Option.Bonus=Option.Bonus,
                       Workout.Bonus=Workout.Bonus,
                       Cap.Hit=Cap.Hit,
                       Restruc.Bonus=Restruc.Bonus,
                       Dead.Cap=Dead.Cap,
                       Misc.Cap=Misc.Cap,
                       Cap.Pct=Cap.Pct)
save.image("~/NFL-Data.RData")

Load up a local copy of the data for now.

load(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/NFL-Data.RData"))

Summaries

library(tidyverse)
library(skimr)
skim(NFL.Salary.Data)
Table 1: Data summary
Name NFL.Salary.Data
Number of rows 2239
Number of columns 14
_______________________
Column type frequency:
factor 5
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Player 0 1 FALSE 2226 Aus: 2, Bra: 2, Bra: 2, Chr: 2
Last.Name 0 1 FALSE 1489 Jon: 41, Wil: 39, Smi: 31, Joh: 30
Team 0 1 FALSE 32 New: 86, Ind: 78, New: 78, Cle: 77
Position 0 1 FALSE 22 WR: 289, CB: 242, DE: 179, OLB: 171
Cap.Hit 0 1 FALSE 1043 480: 279, 555: 230, 630: 185, 705: 66

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Base.Salary 0 1 1702378.82 2559999.20 0 555000.00 700000.00 1500000.00 22500000.00 ▇▁▁▁▁
Signing.Bonus 0 1 449440.82 961034.55 0 0.00 3333.00 397413.50 10500000.00 ▇▁▁▁▁
Roster.Bonus 0 1 206966.64 1021892.52 0 0.00 0.00 0.00 28800000.00 ▇▁▁▁▁
Option.Bonus 0 1 13272.29 168050.78 0 0.00 0.00 0.00 4750000.00 ▇▁▁▁▁
Workout.Bonus 0 1 20166.59 76589.41 0 0.00 0.00 0.00 1335000.00 ▇▁▁▁▁
Restruc.Bonus 0 1 44915.31 341687.19 0 0.00 0.00 0.00 5561666.00 ▇▁▁▁▁
Dead.Cap 0 1 2237274.35 5646281.55 0 0.00 50178.00 1330354.50 82500000.00 ▇▁▁▁▁
Misc.Cap 0 1 9149.17 81182.62 0 0.00 0.00 0.00 2085000.00 ▇▁▁▁▁
Cap.Pct 0 1 1.36 2.05 0 0.31 0.41 1.42 18.02 ▇▁▁▁▁
NFL.Salary.Data %>% group_by(Team) %>% summarise(Total.Base.Salary=sum(Base.Salary))
## # A tibble: 32 x 2
##    Team               Total.Base.Salary
##  * <fct>                          <dbl>
##  1 Arizona Cardinals          105629569
##  2 Atlanta Falcons            132367353
##  3 Baltimore Ravens           111906351
##  4 Buffalo Bills               94577460
##  5 Carolina Panthers          110158355
##  6 Chicago Bears               85128180
##  7 Cincinnati Bengals         134855807
##  8 Cleveland Browns           122139767
##  9 Dallas Cowboys             128347113
## 10 Denver Broncos             140672444
## # … with 22 more rows
NFL.Salary.Data %>% group_by(Position) %>% skim(Base.Salary)
Table 1: Data summary
Name Piped data
Number of rows 2239
Number of columns 14
_______________________
Column type frequency:
numeric 1
________________________
Group variables Position

Variable type: numeric

skim_variable Position n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Base.Salary C 0 1 1964379.0 2173761.0 0 630000.0 862290.5 2693750 9000000 ▇▂▁▁▁
Base.Salary CB 0 1 1594781.4 2433196.2 0 555000.0 630000.0 1329725 13500000 ▇▁▁▁▁
Base.Salary DE 0 1 2007139.1 3065728.5 0 555000.0 790000.0 1812276 17143000 ▇▁▁▁▁
Base.Salary DT 0 1 2014004.0 3128524.2 0 555000.0 630000.0 1760393 16985000 ▇▁▁▁▁
Base.Salary FB 0 1 815000.0 560564.0 0 555000.0 667500.0 837500 2750000 ▂▇▁▁▁
Base.Salary FS 0 1 2641985.1 2783070.4 0 762911.0 1426150.0 3250000 11287000 ▇▁▁▁▁
Base.Salary G 0 1 1704599.2 2115607.1 0 555000.0 790000.0 1880750 10000000 ▇▁▁▁▁
Base.Salary ILB 0 1 1378939.2 1901654.3 0 555000.0 705000.0 1049127 10000000 ▇▁▁▁▁
Base.Salary K 0 1 1383498.1 946083.5 480000 555000.0 899464.0 2037500 3400000 ▇▂▂▂▂
Base.Salary LB 0 1 616500.0 NA 616500 616500.0 616500.0 616500 616500 ▁▁▇▁▁
Base.Salary LS 0 1 761617.7 201707.6 480000 573750.0 790000.0 915000 1100000 ▇▆▅▅▆
Base.Salary LT 0 1 4587732.8 4142986.8 0 961056.2 2310342.0 8825000 12496000 ▇▁▂▂▂
Base.Salary OLB 0 1 1938283.6 2869077.8 0 555000.0 705000.0 1639242 14750000 ▇▁▁▁▁
Base.Salary P 0 1 1242027.0 803853.9 480000 555000.0 1000000.0 1500000 3000000 ▇▃▁▁▂
Base.Salary QB 0 1 3408382.2 5078200.9 0 592500.0 831832.0 3242341 22500000 ▇▁▁▁▁
Base.Salary RB 0 1 1088425.3 1459116.9 0 555000.0 630000.0 1000000 14544000 ▇▁▁▁▁
Base.Salary RT 0 1 2314986.2 1962841.7 0 768750.0 1897621.0 3918750 9341000 ▇▅▃▁▁
Base.Salary S 0 1 576320.5 123126.2 0 480000.0 555000.0 630000 1037723 ▁▁▇▆▁
Base.Salary SS 0 1 1566535.6 1521405.5 0 630000.0 875000.0 1750000 6800000 ▇▂▂▁▁
Base.Salary T 0 1 587082.0 222252.3 480000 480000.0 555000.0 630000 2312212 ▇▁▁▁▁
Base.Salary TE 0 1 1301806.2 1570488.6 480000 555000.0 630000.0 1048771 8250000 ▇▁▁▁▁
Base.Salary WR 0 1 1624579.2 2486576.1 0 555000.0 630000.0 1200000 15982000 ▇▁▁▁▁

Now I have my salary data.

gplots::plotmeans(Base.Salary~Position, data=NFL.Salary.Data, n.label=FALSE, cex=0.6)

boxplot(Base.Salary~Position, data=NFL.Salary.Data)

Combining and Categorizing Data

NFL.Salary.Data$Pos.Char <- NFL.Salary.Data$SideOfBall <- NFL.SubTeam <- as.character(NFL.Salary.Data$Position)
O.Line <- c("LT","RT","C","G","T")
D.Line <- c("DT","DE")
LineBackers <- c("LB","ILB","OLB")
Safeties <- c("S","SS","FS")
CBs <- c("CB")
Special.Teams <- c("K","P","LS")
RBs <- c("RB","FB")
TEs <- c("TE")
WRs <- c("WR")
QBs <- c("QB")
Offense <- c(QBs,RBs,WRs,TEs,O.Line)
Defense <- c(D.Line,LineBackers,Safeties,CBs)
NFL.Salary.Data$SideOfBall[NFL.Salary.Data$Pos.Char %in% Offense] <- "Offense"
NFL.Salary.Data$SideOfBall[NFL.Salary.Data$Pos.Char %in% Special.Teams] <- "Special Teams"
NFL.Salary.Data$SideOfBall[NFL.Salary.Data$Pos.Char %in% Defense] <- "Defense"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% Special.Teams] <- "Special Teams"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% QBs] <- "Quarterbacks"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% RBs] <- "Running Backs"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% TEs] <- "Tight Ends"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% WRs] <- "Wide Receivers"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% O.Line] <- "Offensive Line"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% LineBackers] <- "Linebackers"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% CBs] <- "Cornerbacks"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% D.Line] <- "Defensive Line"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% Safeties] <- "Safeties"

That gives some positional details to the data also.