April 18, 2018

Global mortality tidyTuesday

tidyTuesday on Global Mortality

The three generic challenge graphics involve two global summaries, a raw count by type and a percentage by type. The individual county breakdowns are recorded for a predetermined year below. This can all be seen in the original. For whatever reason, I cannot open this data remotely.

Here is this week’s tidyTuesday.

library(skimr)
library(tidyverse)
library(rlang)
# global_mortality <- readRDS("../../data/global_mortality.rds")
global_mortality <- readRDS(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/global_mortality.rds"))
skim(global_mortality)
Table 1: Data summary
Name global_mortality
Number of rows 6156
Number of columns 35
_______________________
Column type frequency:
character 2
numeric 33
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1.00 4 32 0 228 0
country_code 864 0.86 3 8 0 196 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2003.00 7.79 1990.00 1996.00 2003.00 2010.00 2016.00 ▇▇▇▇▇
Cardiovascular diseases (%) 0 1.00 29.93 14.02 1.43 18.74 30.65 38.45 67.39 ▃▅▇▂▁
Cancers (%) 0 1.00 14.39 8.15 0.58 6.93 13.31 21.36 33.62 ▇▇▆▆▃
Respiratory diseases (%) 0 1.00 4.10 2.35 0.30 2.26 3.63 5.38 16.29 ▇▇▂▁▁
Diabetes (%) 0 1.00 6.29 4.44 0.33 3.20 4.99 7.93 35.82 ▇▂▁▁▁
Dementia (%) 0 1.00 3.22 2.75 0.04 1.01 2.53 4.33 16.67 ▇▃▁▁▁
Lower respiratory infections (%) 0 1.00 5.84 3.42 0.68 3.21 5.14 8.16 20.04 ▇▆▃▁▁
Neonatal deaths (%) 0 1.00 4.57 3.85 0.04 0.69 3.89 7.74 17.81 ▇▃▃▁▁
Diarrheal diseases (%) 0 1.00 3.20 4.36 0.01 0.18 0.77 5.29 25.18 ▇▂▁▁▁
Road accidents (%) 0 1.00 2.53 2.19 0.28 1.36 1.93 2.90 20.90 ▇▁▁▁▁
Liver disease (%) 0 1.00 2.12 1.24 0.19 1.34 1.83 2.54 11.65 ▇▂▁▁▁
Tuberculosis (%) 0 1.00 2.13 2.65 0.01 0.24 0.91 3.33 16.47 ▇▂▁▁▁
Kidney disease (%) 0 1.00 2.09 1.49 0.06 0.90 1.73 2.96 9.95 ▇▅▂▁▁
Digestive diseases (%) 0 1.00 1.97 0.68 0.31 1.51 1.93 2.29 5.16 ▂▇▃▁▁
HIV/AIDS (%) 0 1.00 3.35 7.81 0.00 0.08 0.44 2.33 62.19 ▇▁▁▁▁
Suicide (%) 0 1.00 1.39 1.11 0.10 0.69 1.18 1.80 15.41 ▇▁▁▁▁
Malaria (%) 0 1.00 1.80 4.21 0.00 0.00 0.00 0.48 24.43 ▇▁▁▁▁
Homicide (%) 0 1.00 0.98 1.45 0.05 0.24 0.52 1.00 14.23 ▇▁▁▁▁
Nutritional deficiencies (%) 0 1.00 1.10 1.87 0.00 0.09 0.40 1.34 35.55 ▇▁▁▁▁
Meningitis (%) 0 1.00 0.78 0.97 0.03 0.11 0.35 1.02 6.98 ▇▂▁▁▁
Protein-energy malnutrition (%) 0 1.00 1.00 1.81 0.00 0.06 0.32 1.22 35.52 ▇▁▁▁▁
Drowning (%) 0 1.00 0.71 0.51 0.05 0.35 0.61 0.96 4.51 ▇▂▁▁▁
Maternal deaths (%) 0 1.00 0.59 0.70 0.00 0.03 0.24 1.00 3.41 ▇▂▂▁▁
Parkinson disease (%) 0 1.00 0.29 0.26 0.00 0.07 0.21 0.43 1.59 ▇▃▁▁▁
Alcohol disorders (%) 0 1.00 0.32 0.41 0.01 0.08 0.16 0.39 3.08 ▇▁▁▁▁
Intestinal infectious diseases (%) 0 1.00 0.18 0.30 0.00 0.00 0.01 0.27 2.28 ▇▁▁▁▁
Drug disorders (%) 0 1.00 0.18 0.19 0.00 0.06 0.12 0.23 1.31 ▇▂▁▁▁
Hepatitis (%) 0 1.00 0.16 0.17 0.00 0.04 0.11 0.24 1.58 ▇▁▁▁▁
Fire (%) 0 1.00 0.33 0.18 0.06 0.19 0.32 0.43 1.34 ▇▇▁▁▁
Heat-related (hot and cold exposure) (%) 0 1.00 0.10 0.13 0.01 0.04 0.07 0.11 1.17 ▇▁▁▁▁
Natural disasters (%) 0 1.00 0.09 1.28 0.00 0.00 0.00 0.02 65.29 ▇▁▁▁▁
Conflict (%) 1398 0.77 0.29 2.40 0.00 0.00 0.00 0.02 82.32 ▇▁▁▁▁
Terrorism (%) 1398 0.77 0.04 0.23 0.00 0.00 0.00 0.00 5.88 ▇▁▁▁▁

That loads the data for the challenge.

Counts <- read.csv(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/annual-number-of-deaths-by-cause.csv"))
Counts %>% filter(Year==2016) %>% select(-Code,-Entity,-Execution..deaths.,-Year) %>% apply(., 2, function(x) { sum(x, na.rm=TRUE)}) -> temp1
pp.df <- data.frame(Total.Deaths=temp1,name=names(temp1))
pp.df <- pp.df %>% arrange(Total.Deaths)
pp.df$name <- factor(pp.df$name, levels = pp.df$name)
cplot <- ggplot(pp.df, aes(name,Total.Deaths)) + geom_bar(stat="identity") + coord_flip() + scale_fill_gradientn(colours = terrain.colors(10)) + ggtitle("The Causes of Global Mortailty (2016)")
cplot

With a reenactment of the base target plots, I can turn to new visuals. I wanted to be able to develop a comparison of the various classified causes of death and to try out my nifty function for summarizing panel data. So here goes.

xtsum <- function(formula, data) {
  pform <- terms(formula, data=data)
  unit <- pform[[2]]
  vars <- attr(pform, "term.labels")
  cls <- sapply(data, class)
  data <- data %>% select(which(cls%in%c("numeric","integer")))
  varnames <- intersect(names(data),vars)
  sumfunc <- function(data=data, varname, unit) {
  loc.unit <- enquo(unit)
  varname <- ensym(varname)
    ores <- data %>% filter(!is.na(!! varname)==TRUE) %>% summarise(
    O.mean=round(mean(`$`(data, !! varname), na.rm=TRUE), digits=3),
    O.sd=round(sd(`$`(data, !! varname), na.rm=TRUE), digits=3), 
    O.min = min(`$`(data, !! varname), na.rm=TRUE), 
    O.max=max(`$`(data, !! varname), na.rm=TRUE), 
    O.SumSQ=round(sum(scale(`$`(data, !! varname), center=TRUE, scale=FALSE)^2, na.rm=TRUE), digits=3), 
    O.N=sum(as.numeric((!is.na(`$`(data, !! varname))))))
 bmeans <- data %>% filter(!is.na(!! varname)==TRUE) %>% group_by(!! loc.unit) %>% summarise(
   meanx=mean(`$`(.data, !! varname), na.rm=T), 
   t.count=sum(as.numeric(!is.na(`$`(.data, !! varname)))))
    bres <- bmeans %>% ungroup() %>% summarise(
    B.mean = round(mean(meanx, na.rm=TRUE), digits=3),
    B.sd = round(sd(meanx, na.rm=TRUE), digits=3),
    B.min = min(meanx, na.rm=TRUE), 
    B.max=max(meanx, na.rm=TRUE), 
    Units=sum(as.numeric(!is.na(t.count))), 
    t.bar=round(mean(t.count, na.rm=TRUE), digits=3))
  wdat <- data %>% filter(!is.na(!! varname)==TRUE) %>% group_by(!! loc.unit) %>% mutate(
    W.x = scale(`$`(.data,!! varname), scale=FALSE))
  wres <- wdat %>% ungroup() %>% summarise(
    W.sd=round(sd(W.x, na.rm=TRUE), digits=3), 
    W.min=min(W.x, na.rm=TRUE), 
    W.max=max(W.x, na.rm=TRUE), 
    W.SumSQ=round(sum(W.x^2, na.rm=TRUE), digits=3))
    W.Ratio <- round(wres$W.SumSQ/ores$O.SumSQ, digits=3)
  return(c(ores,bres,wres,Within.Ovr.Ratio=W.Ratio))
  }
res1 <- sapply(varnames, function(x) {sumfunc(data, !!x, !!unit)})
return(t(res1))
}  
global_mortality$countryF <- as.factor(global_mortality$country)
global_mortality$countryN <- as.numeric(as.factor(global_mortality$country))
names(global_mortality) <- gsub(" \\(%\\)","",names(global_mortality))
# For some reason, the xtsum function does not respond to the weird variable names but will accept them devoid of (%)
myxt.res <- xtsum(countryN~., data=global_mortality)
myxt.res
##              O.mean O.sd  O.min       O.max    O.SumSQ  O.N  B.mean B.sd 
## year         2003   7.79  1990        2016     373464   6156 2003   0    
## Cancers      14.387 8.154 0.5822726   33.6175  409267.1 6156 14.387 8.034
## Diabetes     6.286  4.436 0.3271329   35.81619 121099.2 6156 6.286  4.24 
## Dementia     3.221  2.746 0.04475178  16.67248 46417.87 6156 3.221  2.618
## Tuberculosis 2.133  2.648 0.01088091  16.46586 43159.87 6156 2.133  2.554
## Suicide      1.391  1.111 0.1016103   15.41202 7591.252 6156 1.391  1.086
## Malaria      1.801  4.213 0           24.42596 109251.4 6156 1.801  4.11 
## Homicide     0.983  1.454 0.04520209  14.22926 13007.95 6156 0.983  1.425
## Meningitis   0.783  0.967 0.02798604  6.981346 5760.664 6156 0.783  0.951
## Drowning     0.714  0.514 0.05330638  4.510948 1628.778 6156 0.714  0.487
## Hepatitis    0.161  0.165 0.004847886 1.583289 167.675  6156 0.161  0.158
## Fire         0.334  0.181 0.05691324  1.343686 201.769  6156 0.334  0.168
## Conflict     0.291  2.399 0           82.317   27372.61 4758 0.291  0.921
## Terrorism    0.037  0.229 0           5.877    250.161  4758 0.037  0.121
##              B.min       B.max     Units t.bar W.sd  W.min      W.max    
## year         2003        2003      228   27    7.79  -13        13       
## Cancers      3.109526    30.19716  228   27    1.489 -8.017757  5.449552 
## Diabetes     1.183787    27.70547  228   27    1.331 -9.750057  10.92867 
## Dementia     0.2734327   11.82133  228   27    0.847 -5.183193  5.145011 
## Tuberculosis 0.03205589  14.84698  228   27    0.72  -6.169898  7.156982 
## Suicide      0.2441106   12.16457  228   27    0.243 -2.210464  3.247445 
## Malaria      0           20.10408  228   27    0.963 -7.916651  6.903596 
## Homicide     0.05593046  11.06133  228   27    0.303 -3.487857  3.167933 
## Meningitis   0.04221932  5.315156  228   27    0.187 -1.984723  2.166714 
## Drowning     0.06142987  3.515331  228   27    0.168 -1.775808  1.274717 
## Hepatitis    0.008156611 0.9932521 228   27    0.048 -0.3946904 0.8763799
## Fire         0.07003296  0.9950066 228   27    0.069 -0.4329619 0.8721761
## Conflict     0           9.320885  183   26    2.216 -9.320885  78.38404 
## Terrorism    0           1.298962  183   26    0.195 -1.298962  4.578038 
##              W.SumSQ  Within.Ovr.Ratio
## year         373464   1               
## Cancers      13651.25 0.033           
## Diabetes     10897.8  0.09            
## Dementia     4420.371 0.095           
## Tuberculosis 3193.133 0.074           
## Suicide      364.808  0.048           
## Malaria      5711.216 0.052           
## Homicide     564.186  0.043           
## Meningitis   216.175  0.038           
## Drowning     174.405  0.107           
## Hepatitis    14.147   0.084           
## Fire         29.045   0.144           
## Conflict     23357.85 0.853           
## Terrorism    180.881  0.723

The function output can be read as follows. It needs better formatting as a table. For now, O. is an overall measure, overall mean, standard deviation, minimum, maximum, sum of squares and total observations. A between mean, standard deviation, minimum, and maximum with the number of units and the average number of time points. Finally, we have a within standard deviation, minimum, maximum, sum of squares, and a within proportion of the overall variance. In this case, terrorism and conflict are the two variables that vary almost entirely within and far less between countries. I suspect this is because they are rather high in a few places and consistently so.

ggplot.res <- data.frame(myxt.res)
ggplot.res <- ggplot.res[-1,]; ggplot.res <- ggplot.res[,17]
mydf <- t(data.frame(ggplot.res))
mydf <- data.frame(value=mydf,name=rownames(mydf))
mydf <- mydf %>% arrange(value)
mydf$name <- factor(mydf$name, levels = mydf$name)
mydf <- mydf %>% mutate(Emph=as.numeric(value>0.5))
mydf %>% ggplot(aes(name,value, fill=Emph)) + geom_bar(stat="identity") + coord_flip() + ylab("Within Percent of Total Variation") + xlab("Cause of Mortality") + ggtitle("Within-country variation in the Causes of Death") + guides(fill="none")

Mean.Homicide <- global_mortality %>% group_by(countryF) %>% summarise(mymean=mean(Homicide, na.rm=TRUE))
Mean.Homicide %>% ungroup() %>% arrange(mymean) -> mydf
mydf$countryF <- factor(mydf$countryF, levels = mydf$countryF)
mydf %>% ungroup() %>% top_n(30, mymean) %>% arrange(mymean) %>% mutate(Emph=c(rep(1,10),rep(2,10),rep(3,10))) -> mydf
mydf %>% ggplot(aes(countryF,mymean, fill=Emph)) + geom_bar(stat="identity") + coord_flip() + ylab("Homicides") + ggtitle("Top 30 Countries/Places in Homicide") + xlab("") + guides(fill="none") -> Homicideplot
Homicideplot

global_mortality %>% filter(year==2016) %>% ggplot(aes(`Drug disorders`,`Alcohol disorders`)) + geom_point() + ggtitle("Drugs and Alcohol in 2016") -> scatterDA
scatterDA