Some Basic Text Analysis on the Mueller Report
So this Robert Mueller guy wrote a report
I may as well analyse it a bit. There are tons of things that we might wish to discover about the report; my goal is not at that.
First, let me see if I can get a hold of the data. I grabbed the report directly from the Department of Justice website. You can follow this link. The report is really long and making sense of it could be done in an absolute ton of ways. I want to do it pretty simply with an eye toward notes on visualizing long collections of pdf information.
library(tidyverse)
library(pdftools)
# Download report from link above
mueller_report_txt <- pdf_text("../data/report.pdf")
# Create a tibble of the text with line numbers and pages
mueller_report <- tibble(
page = 1:length(mueller_report_txt),
text = mueller_report_txt) %>%
separate_rows(text, sep = "\n") %>%
group_by(page) %>%
mutate(line = row_number()) %>%
ungroup() %>%
select(page, line, text)
write_csv(mueller_report, "data/mueller_report.csv")
Now I can use a .csv of the data; reading the .pdf and hacking it up takes time.
library(pdftools)
library(here)
library(tidyverse)
load("MuellerReport.RData")
head(mueller_report)
## page line
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## text
## 1 U.S. Department of Justice
## 2 AttarAe:,c \\\\'erlc Predtiet // Mtt; CeA1:ttiA Ma1:ertal Prn1:eeted UAder Fed. R. Crhtt. P. 6(e)
## 3 Report On The Investigation Into
## 4 Russian Interference In The
## 5 2016 Presidential Election
## 6 Volume I of II
The text is generally pretty good though there is some garbage. The second line contains redactions and those are the underlying cause. In fact, every page contains this same line though they convert to text in a non-uniform fashion.
mueller_ml2 <- mueller_report %>% dplyr::filter(line != 2)
I want to make use of cleanNLP to turn this into something that I can analyze. The first step is to get rid of the tidyness, of sorts.
Once upon a time, this worked with the linux tools and others. The spacy and corenlp functionality is not native R and the python interface is currently broken, at least on this server.
library(tidyverse)
library(RCurl)
library(tokenizers)
library(cleanNLP)
# cnlp_download_spacy("en")
MRep <- paste(as.character(mueller_ml2$text), " ")
# cnlp_init_stringi()
# starttime <- Sys.time()
# stringi_annotate <- MRep %>% as.character() %>% cnlp_annotate(., verbose=FALSE)
# endtime <- Sys.time()
I wanted to find the bigrams while removing stop words. Apparently, the easiest way to do this is quanteda
. I got this from stack overflow
library(widgetframe)
## Loading required package: htmlwidgets
library(quanteda)
## Package version: 2.1.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(wordcloud)
## Loading required package: RColorBrewer
myDfm <- tokens(MRep) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 2) %>%
dfm()
wc2 <- topfeatures(myDfm, n=150, scheme="count")
wc2.df <- data.frame(word = names(wc2), freq=as.numeric(wc2))
wc2.df$word <- as.character(wc2.df$word)
wc2.df <- wc2.df %>% filter(freq < 300)
# wordcloud(wc2.df, size=0.4)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
frameWidget(hchart(wc2.df, "wordcloud", hcaes(name=word, weight=freq/30)))
pdfpages: A little plot
I found some instructions on constructing the entire document on a grid and pulled the report apart to visualise it in that way.
library(pdftools)
library(png)
pdf_convert("data/report.pdf")
# Dimensions of 1 page.
imgwidth <- 612
imgheight <- 792
# Grid dimensions.
gridwidth <- 30
gridheight <- 15
# Total plot width and height.
spacing <- 1
totalwidth <- (imgwidth+spacing) * (gridwidth)
totalheight <- (imgheight+spacing) * gridheight
# Plot all the pages and save as PNG.
png("RSMReport.png", round((imgwidth+spacing)*gridwidth/7), round((imgheight+spacing)*gridheight/7))
par(mar=c(0,0,0,0))
plot(0, 0, type='n', xlim=c(0, totalwidth), ylim=c(0, totalheight), asp=1, bty="n", axes=FALSE)
for (i in 1:448) {
fname <- paste("report_", i, ".png", sep="")
img <- readPNG(fname)
x <- (i %% gridwidth) * (imgwidth+spacing)
y <- totalheight - (floor(i / gridwidth)) * (imgheight+spacing)
rasterImage(img, xleft=x, ybottom = y-imgheight, xright = x+imgwidth, ytop=y)
}
dev.off()