April 18, 2018

Global mortality tidyTuesday

tidyTuesday on Global Mortality

The three generic challenge graphics involve two global summaries, a raw count by type and a percentage by type. The individual county breakdowns are recorded for a predetermined year below. This can all be seen in the original. For whatever reason, I cannot open this data remotely.

Here is this week’s tidyTuesday.

library(skimr)
library(tidyverse)
library(rlang)
# global_mortality <- readRDS("../../data/global_mortality.rds")
global_mortality <- readRDS(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/global_mortality.rds"))
skim(global_mortality)

Table 1: Data summary
Name	global_mortality
Number of rows	6156
Number of columns	35
_______________________
Column type frequency:
character	2
numeric	33
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1.00	4	32	0	228	0
country_code	864	0.86	3	8	0	196	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2003.00	7.79	1990.00	1996.00	2003.00	2010.00	2016.00	▇▇▇▇▇
Cardiovascular diseases (%)	0	1.00	29.93	14.02	1.43	18.74	30.65	38.45	67.39	▃▅▇▂▁
Cancers (%)	0	1.00	14.39	8.15	0.58	6.93	13.31	21.36	33.62	▇▇▆▆▃
Respiratory diseases (%)	0	1.00	4.10	2.35	0.30	2.26	3.63	5.38	16.29	▇▇▂▁▁
Diabetes (%)	0	1.00	6.29	4.44	0.33	3.20	4.99	7.93	35.82	▇▂▁▁▁
Dementia (%)	0	1.00	3.22	2.75	0.04	1.01	2.53	4.33	16.67	▇▃▁▁▁
Lower respiratory infections (%)	0	1.00	5.84	3.42	0.68	3.21	5.14	8.16	20.04	▇▆▃▁▁
Neonatal deaths (%)	0	1.00	4.57	3.85	0.04	0.69	3.89	7.74	17.81	▇▃▃▁▁
Diarrheal diseases (%)	0	1.00	3.20	4.36	0.01	0.18	0.77	5.29	25.18	▇▂▁▁▁
Road accidents (%)	0	1.00	2.53	2.19	0.28	1.36	1.93	2.90	20.90	▇▁▁▁▁
Liver disease (%)	0	1.00	2.12	1.24	0.19	1.34	1.83	2.54	11.65	▇▂▁▁▁
Tuberculosis (%)	0	1.00	2.13	2.65	0.01	0.24	0.91	3.33	16.47	▇▂▁▁▁
Kidney disease (%)	0	1.00	2.09	1.49	0.06	0.90	1.73	2.96	9.95	▇▅▂▁▁
Digestive diseases (%)	0	1.00	1.97	0.68	0.31	1.51	1.93	2.29	5.16	▂▇▃▁▁
HIV/AIDS (%)	0	1.00	3.35	7.81	0.00	0.08	0.44	2.33	62.19	▇▁▁▁▁
Suicide (%)	0	1.00	1.39	1.11	0.10	0.69	1.18	1.80	15.41	▇▁▁▁▁
Malaria (%)	0	1.00	1.80	4.21	0.00	0.00	0.00	0.48	24.43	▇▁▁▁▁
Homicide (%)	0	1.00	0.98	1.45	0.05	0.24	0.52	1.00	14.23	▇▁▁▁▁
Nutritional deficiencies (%)	0	1.00	1.10	1.87	0.00	0.09	0.40	1.34	35.55	▇▁▁▁▁
Meningitis (%)	0	1.00	0.78	0.97	0.03	0.11	0.35	1.02	6.98	▇▂▁▁▁
Protein-energy malnutrition (%)	0	1.00	1.00	1.81	0.00	0.06	0.32	1.22	35.52	▇▁▁▁▁
Drowning (%)	0	1.00	0.71	0.51	0.05	0.35	0.61	0.96	4.51	▇▂▁▁▁
Maternal deaths (%)	0	1.00	0.59	0.70	0.00	0.03	0.24	1.00	3.41	▇▂▂▁▁
Parkinson disease (%)	0	1.00	0.29	0.26	0.00	0.07	0.21	0.43	1.59	▇▃▁▁▁
Alcohol disorders (%)	0	1.00	0.32	0.41	0.01	0.08	0.16	0.39	3.08	▇▁▁▁▁
Intestinal infectious diseases (%)	0	1.00	0.18	0.30	0.00	0.00	0.01	0.27	2.28	▇▁▁▁▁
Drug disorders (%)	0	1.00	0.18	0.19	0.00	0.06	0.12	0.23	1.31	▇▂▁▁▁
Hepatitis (%)	0	1.00	0.16	0.17	0.00	0.04	0.11	0.24	1.58	▇▁▁▁▁
Fire (%)	0	1.00	0.33	0.18	0.06	0.19	0.32	0.43	1.34	▇▇▁▁▁
Heat-related (hot and cold exposure) (%)	0	1.00	0.10	0.13	0.01	0.04	0.07	0.11	1.17	▇▁▁▁▁
Natural disasters (%)	0	1.00	0.09	1.28	0.00	0.00	0.00	0.02	65.29	▇▁▁▁▁
Conflict (%)	1398	0.77	0.29	2.40	0.00	0.00	0.00	0.02	82.32	▇▁▁▁▁
Terrorism (%)	1398	0.77	0.04	0.23	0.00	0.00	0.00	0.00	5.88	▇▁▁▁▁

That loads the data for the challenge.

Counts <- read.csv(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/annual-number-of-deaths-by-cause.csv"))
Counts %>% filter(Year==2016) %>% select(-Code,-Entity,-Execution..deaths.,-Year) %>% apply(., 2, function(x) { sum(x, na.rm=TRUE)}) -> temp1
pp.df <- data.frame(Total.Deaths=temp1,name=names(temp1))
pp.df <- pp.df %>% arrange(Total.Deaths)
pp.df$name <- factor(pp.df$name, levels = pp.df$name)
cplot <- ggplot(pp.df, aes(name,Total.Deaths)) + geom_bar(stat="identity") + coord_flip() + scale_fill_gradientn(colours = terrain.colors(10)) + ggtitle("The Causes of Global Mortailty (2016)")
cplot

With a reenactment of the base target plots, I can turn to new visuals. I wanted to be able to develop a comparison of the various classified causes of death and to try out my nifty function for summarizing panel data. So here goes.

xtsum <- function(formula, data) {
  pform <- terms(formula, data=data)
  unit <- pform[[2]]
  vars <- attr(pform, "term.labels")
  cls <- sapply(data, class)
  data <- data %>% select(which(cls%in%c("numeric","integer")))
  varnames <- intersect(names(data),vars)
  sumfunc <- function(data=data, varname, unit) {
  loc.unit <- enquo(unit)
  varname <- ensym(varname)
    ores <- data %>% filter(!is.na(!! varname)==TRUE) %>% summarise(
    O.mean=round(mean(`$`(data, !! varname), na.rm=TRUE), digits=3),
    O.sd=round(sd(`$`(data, !! varname), na.rm=TRUE), digits=3), 
    O.min = min(`$`(data, !! varname), na.rm=TRUE), 
    O.max=max(`$`(data, !! varname), na.rm=TRUE), 
    O.SumSQ=round(sum(scale(`$`(data, !! varname), center=TRUE, scale=FALSE)^2, na.rm=TRUE), digits=3), 
    O.N=sum(as.numeric((!is.na(`$`(data, !! varname))))))
 bmeans <- data %>% filter(!is.na(!! varname)==TRUE) %>% group_by(!! loc.unit) %>% summarise(
   meanx=mean(`$`(.data, !! varname), na.rm=T), 
   t.count=sum(as.numeric(!is.na(`$`(.data, !! varname)))))
    bres <- bmeans %>% ungroup() %>% summarise(
    B.mean = round(mean(meanx, na.rm=TRUE), digits=3),
    B.sd = round(sd(meanx, na.rm=TRUE), digits=3),
    B.min = min(meanx, na.rm=TRUE), 
    B.max=max(meanx, na.rm=TRUE), 
    Units=sum(as.numeric(!is.na(t.count))), 
    t.bar=round(mean(t.count, na.rm=TRUE), digits=3))
  wdat <- data %>% filter(!is.na(!! varname)==TRUE) %>% group_by(!! loc.unit) %>% mutate(
    W.x = scale(`$`(.data,!! varname), scale=FALSE))
  wres <- wdat %>% ungroup() %>% summarise(
    W.sd=round(sd(W.x, na.rm=TRUE), digits=3), 
    W.min=min(W.x, na.rm=TRUE), 
    W.max=max(W.x, na.rm=TRUE), 
    W.SumSQ=round(sum(W.x^2, na.rm=TRUE), digits=3))
    W.Ratio <- round(wres$W.SumSQ/ores$O.SumSQ, digits=3)
  return(c(ores,bres,wres,Within.Ovr.Ratio=W.Ratio))
  }
res1 <- sapply(varnames, function(x) {sumfunc(data, !!x, !!unit)})
return(t(res1))
}  
global_mortality$countryF <- as.factor(global_mortality$country)
global_mortality$countryN <- as.numeric(as.factor(global_mortality$country))
names(global_mortality) <- gsub(" \\(%\\)","",names(global_mortality))
# For some reason, the xtsum function does not respond to the weird variable names but will accept them devoid of (%)
myxt.res <- xtsum(countryN~., data=global_mortality)
myxt.res

##              O.mean O.sd  O.min       O.max    O.SumSQ  O.N  B.mean B.sd 
## year         2003   7.79  1990        2016     373464   6156 2003   0    
## Cancers      14.387 8.154 0.5822726   33.6175  409267.1 6156 14.387 8.034
## Diabetes     6.286  4.436 0.3271329   35.81619 121099.2 6156 6.286  4.24 
## Dementia     3.221  2.746 0.04475178  16.67248 46417.87 6156 3.221  2.618
## Tuberculosis 2.133  2.648 0.01088091  16.46586 43159.87 6156 2.133  2.554
## Suicide      1.391  1.111 0.1016103   15.41202 7591.252 6156 1.391  1.086
## Malaria      1.801  4.213 0           24.42596 109251.4 6156 1.801  4.11 
## Homicide     0.983  1.454 0.04520209  14.22926 13007.95 6156 0.983  1.425
## Meningitis   0.783  0.967 0.02798604  6.981346 5760.664 6156 0.783  0.951
## Drowning     0.714  0.514 0.05330638  4.510948 1628.778 6156 0.714  0.487
## Hepatitis    0.161  0.165 0.004847886 1.583289 167.675  6156 0.161  0.158
## Fire         0.334  0.181 0.05691324  1.343686 201.769  6156 0.334  0.168
## Conflict     0.291  2.399 0           82.317   27372.61 4758 0.291  0.921
## Terrorism    0.037  0.229 0           5.877    250.161  4758 0.037  0.121
##              B.min       B.max     Units t.bar W.sd  W.min      W.max    
## year         2003        2003      228   27    7.79  -13        13       
## Cancers      3.109526    30.19716  228   27    1.489 -8.017757  5.449552 
## Diabetes     1.183787    27.70547  228   27    1.331 -9.750057  10.92867 
## Dementia     0.2734327   11.82133  228   27    0.847 -5.183193  5.145011 
## Tuberculosis 0.03205589  14.84698  228   27    0.72  -6.169898  7.156982 
## Suicide      0.2441106   12.16457  228   27    0.243 -2.210464  3.247445 
## Malaria      0           20.10408  228   27    0.963 -7.916651  6.903596 
## Homicide     0.05593046  11.06133  228   27    0.303 -3.487857  3.167933 
## Meningitis   0.04221932  5.315156  228   27    0.187 -1.984723  2.166714 
## Drowning     0.06142987  3.515331  228   27    0.168 -1.775808  1.274717 
## Hepatitis    0.008156611 0.9932521 228   27    0.048 -0.3946904 0.8763799
## Fire         0.07003296  0.9950066 228   27    0.069 -0.4329619 0.8721761
## Conflict     0           9.320885  183   26    2.216 -9.320885  78.38404 
## Terrorism    0           1.298962  183   26    0.195 -1.298962  4.578038 
##              W.SumSQ  Within.Ovr.Ratio
## year         373464   1               
## Cancers      13651.25 0.033           
## Diabetes     10897.8  0.09            
## Dementia     4420.371 0.095           
## Tuberculosis 3193.133 0.074           
## Suicide      364.808  0.048           
## Malaria      5711.216 0.052           
## Homicide     564.186  0.043           
## Meningitis   216.175  0.038           
## Drowning     174.405  0.107           
## Hepatitis    14.147   0.084           
## Fire         29.045   0.144           
## Conflict     23357.85 0.853           
## Terrorism    180.881  0.723

The function output can be read as follows. It needs better formatting as a table. For now, O. is an overall measure, overall mean, standard deviation, minimum, maximum, sum of squares and total observations. A between mean, standard deviation, minimum, and maximum with the number of units and the average number of time points. Finally, we have a within standard deviation, minimum, maximum, sum of squares, and a within proportion of the overall variance. In this case, terrorism and conflict are the two variables that vary almost entirely within and far less between countries. I suspect this is because they are rather high in a few places and consistently so.

ggplot.res <- data.frame(myxt.res)
ggplot.res <- ggplot.res[-1,]; ggplot.res <- ggplot.res[,17]
mydf <- t(data.frame(ggplot.res))
mydf <- data.frame(value=mydf,name=rownames(mydf))
mydf <- mydf %>% arrange(value)
mydf$name <- factor(mydf$name, levels = mydf$name)
mydf <- mydf %>% mutate(Emph=as.numeric(value>0.5))
mydf %>% ggplot(aes(name,value, fill=Emph)) + geom_bar(stat="identity") + coord_flip() + ylab("Within Percent of Total Variation") + xlab("Cause of Mortality") + ggtitle("Within-country variation in the Causes of Death") + guides(fill="none")

Mean.Homicide <- global_mortality %>% group_by(countryF) %>% summarise(mymean=mean(Homicide, na.rm=TRUE))
Mean.Homicide %>% ungroup() %>% arrange(mymean) -> mydf
mydf$countryF <- factor(mydf$countryF, levels = mydf$countryF)
mydf %>% ungroup() %>% top_n(30, mymean) %>% arrange(mymean) %>% mutate(Emph=c(rep(1,10),rep(2,10),rep(3,10))) -> mydf
mydf %>% ggplot(aes(countryF,mymean, fill=Emph)) + geom_bar(stat="identity") + coord_flip() + ylab("Homicides") + ggtitle("Top 30 Countries/Places in Homicide") + xlab("") + guides(fill="none") -> Homicideplot
Homicideplot

global_mortality %>% filter(year==2016) %>% ggplot(aes(`Drug disorders`,`Alcohol disorders`)) + geom_point() + ggtitle("Drugs and Alcohol in 2016") -> scatterDA
scatterDA