October 19, 2020

Datasaurus Dozen

The `datasaurus dozen`

The datasaurus dozen is a fantastic teaching resource for examining the importance of data visualization. Let’s have a look. The basic idea is that all thirteen (datasaurus plus 12) contain nearly identical means and standard deviations though they do vary if the five number summaries are deployed. The scatterplots that are derived from data with similar x-y summaries is a useful reminder that data science is about patterns, not just statistics.

datasaurus <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-13/datasaurus.csv')

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   dataset = col_character(),
##   x = col_double(),
##   y = col_double()
## )

Two libraries to make our work easy.

library(tidyverse)
library(skimr)

First, the summary statistics. Summary statistics are great but they are no substitute for basic data familiarity. Notice, we have nearly identical means and standard deviations though the five number summaries do vary.

datasaurus %>% group_by(dataset) %>% skim_to_wide(x,y) %>% knitr::kable("html", 2) %>% scroll_box(width="100%", height="500px")

skim_type	skim_variable	dataset	complete_rate	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
numeric	x	away	1	54.27	16.77	15.56	39.72	53.34	69.15	91.64	▁▇▃▇▁
numeric	x	bullseye	1	54.27	16.77	19.29	41.63	53.84	64.80	91.74	▂▆▇▅▂
numeric	x	circle	1	54.27	16.76	21.86	43.38	54.02	64.97	85.66	▅▃▇▅▃
numeric	x	dino	1	54.26	16.77	22.31	44.10	53.33	64.74	98.21	▅▇▇▅▂
numeric	x	dots	1	54.26	16.77	25.44	50.36	50.98	75.20	77.95	▂▁▇▁▅
numeric	x	h_lines	1	54.26	16.77	22.00	42.29	53.07	66.77	98.29	▅▇▇▅▁
numeric	x	high_lines	1	54.27	16.77	17.89	41.54	54.17	63.95	96.08	▂▅▇▃▁
numeric	x	slant_down	1	54.27	16.77	18.11	42.89	53.14	64.47	95.59	▂▅▇▃▁
numeric	x	slant_up	1	54.27	16.77	20.21	42.81	54.26	64.49	95.26	▃▆▇▃▂
numeric	x	star	1	54.27	16.77	27.02	41.03	56.53	68.71	86.44	▅▇▇▃▆
numeric	x	v_lines	1	54.27	16.77	30.45	49.96	50.36	69.50	89.50	▃▇▁▅▁
numeric	x	wide_lines	1	54.27	16.77	27.44	35.52	64.55	67.45	77.92	▇▂▁▇▅
numeric	x	x_shape	1	54.26	16.77	31.11	40.09	47.14	71.86	85.45	▇▆▁▃▅
numeric	y	away	1	47.83	26.94	0.02	24.63	47.54	71.80	97.48	▅▆▃▇▃
numeric	y	bullseye	1	47.83	26.94	9.69	26.24	47.38	72.53	85.88	▇▆▃▅▇
numeric	y	circle	1	47.84	26.93	16.33	18.35	51.03	77.78	85.58	▇▁▁▂▆
numeric	y	dino	1	47.83	26.94	2.95	25.29	46.03	68.53	99.49	▇▇▇▅▆
numeric	y	dots	1	47.84	26.93	15.77	17.11	51.30	82.88	94.25	▇▁▇▁▆
numeric	y	h_lines	1	47.83	26.94	10.46	30.48	50.47	70.35	90.46	▆▇▇▅▅
numeric	y	high_lines	1	47.84	26.94	14.91	22.92	32.50	75.94	87.15	▇▁▁▃▅
numeric	y	slant_down	1	47.84	26.94	0.30	27.84	46.40	68.44	99.64	▆▇▇▅▆
numeric	y	slant_up	1	47.83	26.94	5.65	24.76	45.29	70.86	99.58	▇▇▇▅▅
numeric	y	star	1	47.84	26.93	14.37	20.37	50.11	63.55	92.21	▇▂▂▅▅
numeric	y	v_lines	1	47.84	26.94	2.73	22.75	47.11	65.85	99.69	▇▆▇▃▅
numeric	y	wide_lines	1	47.83	26.94	0.22	24.35	46.28	67.57	99.28	▇▇▇▅▆
numeric	y	x_shape	1	47.84	26.93	4.58	23.47	39.88	73.61	97.84	▇▇▂▆▅

Notice that all of the datasets are nearly identical. But have a look at them.

DP <- datasaurus %>% ggplot() + aes(x=x, y=y, color=dataset, group=dataset) + geom_point() + guides(color=FALSE) + facet_wrap(vars(dataset))
DP