April 4, 2018

Scraping the NFL Salary Cap Data with Python and R

The NFL Data

[SporTrac](http://www.sportrac.com] has a wonderful array of financial data on sports. A student going to work for the Seattle Seahawks wanted the NFL salary cap data and I also found data on the English Premier League there. Now I have a source to scrape the data from.

With a source in hand, the key tool is the SelectorGadget. SelectorGadget is a browser add-in for Chrome that allows us to select text and identify the css or xpath selector to scrape the data. With that, it becomes easy to identify what we need. I navgiated to the website in base_url and found the team names had links to the cap data. I will use those links first. Let’s build that.

library(rvest)
base_url <- "http://www.spotrac.com/nfl/"
read.base <- read_html(base_url)
team.URL <- read.base %>% html_nodes(".team-name") %>% html_attr('href')
team.URL

##  [1] "https://www.spotrac.com/nfl/arizona-cardinals/cap/"       
##  [2] "https://www.spotrac.com/nfl/atlanta-falcons/cap/"         
##  [3] "https://www.spotrac.com/nfl/baltimore-ravens/cap/"        
##  [4] "https://www.spotrac.com/nfl/buffalo-bills/cap/"           
##  [5] "https://www.spotrac.com/nfl/carolina-panthers/cap/"       
##  [6] "https://www.spotrac.com/nfl/chicago-bears/cap/"           
##  [7] "https://www.spotrac.com/nfl/cincinnati-bengals/cap/"      
##  [8] "https://www.spotrac.com/nfl/cleveland-browns/cap/"        
##  [9] "https://www.spotrac.com/nfl/dallas-cowboys/cap/"          
## [10] "https://www.spotrac.com/nfl/denver-broncos/cap/"          
## [11] "https://www.spotrac.com/nfl/detroit-lions/cap/"           
## [12] "https://www.spotrac.com/nfl/green-bay-packers/cap/"       
## [13] "https://www.spotrac.com/nfl/houston-texans/cap/"          
## [14] "https://www.spotrac.com/nfl/indianapolis-colts/cap/"      
## [15] "https://www.spotrac.com/nfl/jacksonville-jaguars/cap/"    
## [16] "https://www.spotrac.com/nfl/kansas-city-chiefs/cap/"      
## [17] "https://www.spotrac.com/nfl/las-vegas-raiders/cap/"       
## [18] "https://www.spotrac.com/nfl/los-angeles-chargers/cap/"    
## [19] "https://www.spotrac.com/nfl/los-angeles-rams/cap/"        
## [20] "https://www.spotrac.com/nfl/miami-dolphins/cap/"          
## [21] "https://www.spotrac.com/nfl/minnesota-vikings/cap/"       
## [22] "https://www.spotrac.com/nfl/new-england-patriots/cap/"    
## [23] "https://www.spotrac.com/nfl/new-orleans-saints/cap/"      
## [24] "https://www.spotrac.com/nfl/new-york-giants/cap/"         
## [25] "https://www.spotrac.com/nfl/new-york-jets/cap/"           
## [26] "https://www.spotrac.com/nfl/philadelphia-eagles/cap/"     
## [27] "https://www.spotrac.com/nfl/pittsburgh-steelers/cap/"     
## [28] "https://www.spotrac.com/nfl/san-francisco-49ers/cap/"     
## [29] "https://www.spotrac.com/nfl/seattle-seahawks/cap/"        
## [30] "https://www.spotrac.com/nfl/tampa-bay-buccaneers/cap/"    
## [31] "https://www.spotrac.com/nfl/tennessee-titans/cap/"        
## [32] "https://www.spotrac.com/nfl/washington-football-team/cap/"

# Clean up the URLs to get the team names by themselves.
team.names <- gsub("/cap/","", gsub(base_url, "", team.URL))

With 32 team links to scrape, I can now scrape them.

Grabbing Team Tables

Now I need to explore those links. When I started this, I wanted to learn some Python along the way. I found the following Python code that worked for the task. For completeness, I leave it here.

import pandas as pd
import bs4
import re
import urllib2
#import requests

from urllib import urlopen
from bs4 import BeautifulSoup
base_url = "http://www.spotrac.com/nfl/"

def get_page(url):
    page = urlopen(base_url)
    soup = BeautifulSoup(page, 'lxml')
    file = open("spotrac_urls.txt", 'w')
    file.write(str(soup))
    file.close()

def get_team_table(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')

get_page(base_url)

with open("spotrac_urls.txt", 'r') as file:
    for line in file:
        line = line.strip()

from bs4 import BeautifulSoup
page = open("spotrac_urls.txt", 'r')
soup = BeautifulSoup(page, "lxml")
div = soup.find("div","subnav-posts")

links = div.find_all('a')
for link in links:
    print(link.get('href'))

len(links)

def get_team_table(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    data_rows = [row for row in soup.find("table", "datatable").find_all("tr")]
    return data_rows

# create an empty list
team_data = []

for link in links:
    team_data.append(get_team_table(link.get('href')))

len(team_data)

#data_rows = [row for row in soup.find("td", "center").find_all("tr")]
table_data = []

#soup = BeautifulSoup(team_data[0], 'lxml')

#This needs to be a nested for loop because inner items of the list are BeautifulSoup Elements
for row in team_data:
    for element in row:
        #print(type(element))
        if soup.find_all("td", attrs={"class":" right xs-hide "}) is not None:
            table_data.append(element.get_text())

player_data = []
for row in table_data:
    player_data.append(row.split("\n"))
    #print(player_data)

len(player_data)

import pandas as pd
df = pd.DataFrame(player_data)
df = df.drop(14, 1)
df = df.drop(0, 1)
df = df.drop(1, 1)


df = df.drop(df.index[[0]])
df.set_index(1, inplace=True)
print(df.shape)
df.head()

players = []
for row in team_data[0]:
    if row.get_text("tr") is not None:
        players.append(row) 

column_headers = [col.get_text() for col in players[0].find_all("th") if col.get_text()]
len(column_headers)

df.columns = column_headers
df.head()

#The header repeated itself in the data.  This didn't reveal itself until the data type conversion step below
#but this fixes all occurrences of it.
rows_to_be_dropped = df.loc[df['Cap Hit'] == 'Cap %'].index
rows_to_be_kept = df.loc[df['Cap Hit'] != 'Cap %'].index
totlen = len(df)
df2a = df.drop(rows_to_be_kept)
df = df.drop(rows_to_be_dropped)
df2 = pd.Series(data=rows_to_be_dropped)

#Apply a regex to convert the 'Cap Hit' column from a string to a float.  
# df['Cap Hit'] =(df['Cap Hit'].replace('[\$,)]', "", regex=True).replace( '[(]','-',   regex=True ).astype(float))

# My fix
df['Cap.Hit'] = (df['Cap Hit'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).replace('\s\s.*', "", regex=True).astype(float))
df['Base.Salary'] = (df['Base Salary'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Signing.Bonus'] = (df['Signing Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Roster.Bonus'] = (df['Roster Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Option.Bonus'] = (df['Option Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Workout.Bonus'] = (df['Workout Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Restruc.Bonus'] = (df['Restruc. Bonus'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Misc'] = (df['Misc.'].replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))
df['Dead.Cap'] = (df['Dead Cap'].replace('\(',"",regex=True).replace('\)',"",regex=True).replace('[\$,)]', "", regex=True).replace('[-,)]', "", regex=True).astype(float))

#Sanity check to make sure it worked.
df['Cap Hit'].sum()

There are a few wonderful things about Beautiful Soup. But it is still faster and easier for me to do things in R. Here is some R code for the same task. The R table is a messy data structure but is easier to get.

library(stringr)
data.creator <- function(link) {
read_html(link) %>% html_nodes("table") %>% html_table(header=TRUE, fill=TRUE) -> res
names(res) <- c("Active","Draft","Inactive","Team1","Team2")
return(res)
  }
team.names <- gsub("-", " ", team.names)
simpleCap <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(toupper(substring(s, 1,1)), substring(s, 2),
      sep="", collapse=" ")
}
team.names <- sapply(team.names, simpleCap)
NFL.scrape <- sapply(team.URL, function(x) {data.creator(x)})
names(NFL.scrape) <- team.names 
# This is a hack but it works
Actives <- lapply(NFL.scrape, function(x){x$Active})
rep.res <- sapply(seq(1,32), function(x) {dim(Actives[[x]])[[1]]})
clean.me.2 <- function(data) {
data %>% str_remove_all("[\\t]") %>% str_split("\\n\\n\\n") %>% unlist() %>% matrix(data=., ncol=2, byrow=TRUE) -> dat
return(dat)
}
clean.me <- function(data) {
str_remove_all(data, "[\\t]")
}
clean.me.num <- function(data) {
str_remove_all(data, "[\\$,()\\-]")
  }
Players <- lapply(Actives, function(x){ clean.me.2(x[,1])})
Last.Name <- unlist(lapply(Players, function(x) {x[,1]}))
Player.Name <- unlist(lapply(Players, function(x) {x[,2]}))
Team <- rep(names(Actives),rep.res)
Position <- unlist(lapply(Actives, function(x){ clean.me(x[,2])}))
Base.Salary <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Base Salary'])}))
Base.Salary <- as.numeric(Base.Salary)
Signing.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Signing Bonus'])}))
Signing.Bonus <- as.numeric(Signing.Bonus)
Roster.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Roster Bonus'])}))
Roster.Bonus <- as.numeric(Roster.Bonus)
Option.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Option Bonus'])}))
Option.Bonus <- as.numeric(Option.Bonus)
Workout.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Workout Bonus'])}))
Workout.Bonus <- as.numeric(Workout.Bonus)
Cap.Pct <- unlist(lapply(Actives, function(x){ x[,'Cap %']}))
Restruc.Bonus <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Restruc. Bonus'])}))
Restruc.Bonus <- as.numeric(Restruc.Bonus)
Dead.Cap <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Dead Cap'])}))
Dead.Cap <- as.numeric(Dead.Cap)
Misc.Cap <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Misc.'])}))
Misc.Cap <- as.numeric(Misc.Cap)
Cap.Hit <- unlist(lapply(Actives, function(x){ clean.me.num(x[,'Cap Hit'])}))
Cap.Hit <- str_replace_all(Cap.Hit, pattern="\\s\\s*", "")
NFL.Salary.Data <- data.frame(Player=Player.Name, 
                       Last.Name = Last.Name,
                       Team=Team, 
                       Position=Position, 
                       Base.Salary=Base.Salary,
                       Signing.Bonus=Signing.Bonus,
                       Roster.Bonus=Roster.Bonus,
                       Option.Bonus=Option.Bonus,
                       Workout.Bonus=Workout.Bonus,
                       Cap.Hit=Cap.Hit,
                       Restruc.Bonus=Restruc.Bonus,
                       Dead.Cap=Dead.Cap,
                       Misc.Cap=Misc.Cap,
                       Cap.Pct=Cap.Pct)
save.image("~/NFL-Data.RData")

Load up a local copy of the data for now.

load(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/NFL-Data.RData"))

Summaries

library(tidyverse)
library(skimr)
skim(NFL.Salary.Data)

Table 1: Data summary
Name	NFL.Salary.Data
Number of rows	2239
Number of columns	14
_______________________
Column type frequency:
factor	5
numeric	9
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Player	1	FALSE	2226	Aus: 2, Bra: 2, Bra: 2, Chr: 2
Last.Name	1	FALSE	1489	Jon: 41, Wil: 39, Smi: 31, Joh: 30
Team	1	FALSE	32	New: 86, Ind: 78, New: 78, Cle: 77
Position	1	FALSE	22	WR: 289, CB: 242, DE: 179, OLB: 171
Cap.Hit	1	FALSE	1043	480: 279, 555: 230, 630: 185, 705: 66

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
Base.Salary	1	1702378.82	2559999.20	555000.00	700000.00	1500000.00	22500000.00	▇▁▁▁▁
Signing.Bonus	1	449440.82	961034.55	0.00	3333.00	397413.50	10500000.00	▇▁▁▁▁
Roster.Bonus	1	206966.64	1021892.52	0.00	0.00	0.00	28800000.00	▇▁▁▁▁
Option.Bonus	1	13272.29	168050.78	0.00	0.00	0.00	4750000.00	▇▁▁▁▁
Workout.Bonus	1	20166.59	76589.41	0.00	0.00	0.00	1335000.00	▇▁▁▁▁
Restruc.Bonus	1	44915.31	341687.19	0.00	0.00	0.00	5561666.00	▇▁▁▁▁
Dead.Cap	1	2237274.35	5646281.55	0.00	50178.00	1330354.50	82500000.00	▇▁▁▁▁
Misc.Cap	1	9149.17	81182.62	0.00	0.00	0.00	2085000.00	▇▁▁▁▁
Cap.Pct	1	1.36	2.05	0.31	0.41	1.42	18.02	▇▁▁▁▁

NFL.Salary.Data %>% group_by(Team) %>% summarise(Total.Base.Salary=sum(Base.Salary))

## # A tibble: 32 x 2
##    Team               Total.Base.Salary
##  * <fct>                          <dbl>
##  1 Arizona Cardinals          105629569
##  2 Atlanta Falcons            132367353
##  3 Baltimore Ravens           111906351
##  4 Buffalo Bills               94577460
##  5 Carolina Panthers          110158355
##  6 Chicago Bears               85128180
##  7 Cincinnati Bengals         134855807
##  8 Cleveland Browns           122139767
##  9 Dallas Cowboys             128347113
## 10 Denver Broncos             140672444
## # … with 22 more rows

NFL.Salary.Data %>% group_by(Position) %>% skim(Base.Salary)

Table 1: Data summary
Name	Piped data
Number of rows	2239
Number of columns	14
_______________________
Column type frequency:
numeric	1
________________________
Group variables	Position

Variable type: numeric

skim_variable	Position	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Base.Salary	C	1	1964379.0	2173761.0	0	630000.0	862290.5	2693750	9000000	▇▂▁▁▁
Base.Salary	CB	1	1594781.4	2433196.2	0	555000.0	630000.0	1329725	13500000	▇▁▁▁▁
Base.Salary	DE	1	2007139.1	3065728.5	0	555000.0	790000.0	1812276	17143000	▇▁▁▁▁
Base.Salary	DT	1	2014004.0	3128524.2	0	555000.0	630000.0	1760393	16985000	▇▁▁▁▁
Base.Salary	FB	1	815000.0	560564.0	0	555000.0	667500.0	837500	2750000	▂▇▁▁▁
Base.Salary	FS	1	2641985.1	2783070.4	0	762911.0	1426150.0	3250000	11287000	▇▁▁▁▁
Base.Salary	G	1	1704599.2	2115607.1	0	555000.0	790000.0	1880750	10000000	▇▁▁▁▁
Base.Salary	ILB	1	1378939.2	1901654.3	0	555000.0	705000.0	1049127	10000000	▇▁▁▁▁
Base.Salary	K	1	1383498.1	946083.5	480000	555000.0	899464.0	2037500	3400000	▇▂▂▂▂
Base.Salary	LB	1	616500.0	NA	616500	616500.0	616500.0	616500	616500	▁▁▇▁▁
Base.Salary	LS	1	761617.7	201707.6	480000	573750.0	790000.0	915000	1100000	▇▆▅▅▆
Base.Salary	LT	1	4587732.8	4142986.8	0	961056.2	2310342.0	8825000	12496000	▇▁▂▂▂
Base.Salary	OLB	1	1938283.6	2869077.8	0	555000.0	705000.0	1639242	14750000	▇▁▁▁▁
Base.Salary	P	1	1242027.0	803853.9	480000	555000.0	1000000.0	1500000	3000000	▇▃▁▁▂
Base.Salary	QB	1	3408382.2	5078200.9	0	592500.0	831832.0	3242341	22500000	▇▁▁▁▁
Base.Salary	RB	1	1088425.3	1459116.9	0	555000.0	630000.0	1000000	14544000	▇▁▁▁▁
Base.Salary	RT	1	2314986.2	1962841.7	0	768750.0	1897621.0	3918750	9341000	▇▅▃▁▁
Base.Salary	S	1	576320.5	123126.2	0	480000.0	555000.0	630000	1037723	▁▁▇▆▁
Base.Salary	SS	1	1566535.6	1521405.5	0	630000.0	875000.0	1750000	6800000	▇▂▂▁▁
Base.Salary	T	1	587082.0	222252.3	480000	480000.0	555000.0	630000	2312212	▇▁▁▁▁
Base.Salary	TE	1	1301806.2	1570488.6	480000	555000.0	630000.0	1048771	8250000	▇▁▁▁▁
Base.Salary	WR	1	1624579.2	2486576.1	0	555000.0	630000.0	1200000	15982000	▇▁▁▁▁

Now I have my salary data.

gplots::plotmeans(Base.Salary~Position, data=NFL.Salary.Data, n.label=FALSE, cex=0.6)

boxplot(Base.Salary~Position, data=NFL.Salary.Data)

Combining and Categorizing Data

NFL.Salary.Data$Pos.Char <- NFL.Salary.Data$SideOfBall <- NFL.SubTeam <- as.character(NFL.Salary.Data$Position)
O.Line <- c("LT","RT","C","G","T")
D.Line <- c("DT","DE")
LineBackers <- c("LB","ILB","OLB")
Safeties <- c("S","SS","FS")
CBs <- c("CB")
Special.Teams <- c("K","P","LS")
RBs <- c("RB","FB")
TEs <- c("TE")
WRs <- c("WR")
QBs <- c("QB")
Offense <- c(QBs,RBs,WRs,TEs,O.Line)
Defense <- c(D.Line,LineBackers,Safeties,CBs)
NFL.Salary.Data$SideOfBall[NFL.Salary.Data$Pos.Char %in% Offense] <- "Offense"
NFL.Salary.Data$SideOfBall[NFL.Salary.Data$Pos.Char %in% Special.Teams] <- "Special Teams"
NFL.Salary.Data$SideOfBall[NFL.Salary.Data$Pos.Char %in% Defense] <- "Defense"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% Special.Teams] <- "Special Teams"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% QBs] <- "Quarterbacks"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% RBs] <- "Running Backs"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% TEs] <- "Tight Ends"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% WRs] <- "Wide Receivers"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% O.Line] <- "Offensive Line"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% LineBackers] <- "Linebackers"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% CBs] <- "Cornerbacks"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% D.Line] <- "Defensive Line"
NFL.Salary.Data$SubTeam[NFL.Salary.Data$Pos.Char %in% Safeties] <- "Safeties"

That gives some positional details to the data also.