Longitudinal Panel Data R Packages
Longitudinal and Panel Data Analysis in R
Goal: A CRAN task view for panel/longitudinal data analysis in R.
What is Panel Data?
Panel data are variously called longitudinal, panel, cross-sectional time series, and pooled time series data. The most precise definition is two-dimensional data; invariably one of the dimensions is time. We can think about a general depiction of what a model with linear coefficients typical for such data structures, though ridiculously overparameterized, like so:
\[ y_{it} = \alpha_{it} + X_{it}\beta_{it} + \epsilon_{it} \]
For example, suppose that we have a set of countries, or US states, municipalities in a state, or even individual people, indexed by \(i \in N\). We observe those data at \(t \in T\) points in time. To be clear, a standard cross-section of data in this notation suppresses \(t\) because there is no observed over time variation. A single time series suppresses the \(i\). Of course, \(N\) and \(T\) are greater than 1.
Common transformations: ANOVA
There are common transformations undertaken on the data prior to analysis. For example, the within transformation that underlies a fixed effects regression model transforms each vector of data (for \(i\)) as follows:
\[ x^{W}_{it} = x_{it} - \overline{x}_{i} \]
The within-transformed data are of identical dimensions to the untransformed data. The other common transformation isolates the \(n-vector\) of unit means – the so-called between data. Because the dimensions are non-overlapping/orthogonal; the key result in the applied statistics shows that the total variation in a variable \(x\) is the sum of the within variance and the between variance.
Some Basic Summary
dplyr makes the basic summary and presentation quite straightforward. plm
contains a few datasets; I will choose an excerpt from Summers and Heston’s Penn World Tables – SumHes.
library(plm)
library(tidyverse)
library(dplyr)
data(SumHes)
summary(SumHes)
## year country opec com pop
## Min. :1960 ALGERIA : 26 no :3146 no :3120 Min. : 42
## 1st Qu.:1966 ANGOLA : 26 yes: 104 yes: 130 1st Qu.: 2266
## Median :1972 BENIN : 26 Median : 5919
## Mean :1972 BOTSWANA : 26 Mean : 29436
## 3rd Qu.:1979 BURKINA FASO: 26 3rd Qu.: 18529
## Max. :1985 BURUNDI : 26 Max. :1051013
## (Other) :3094
## gdp sr
## Min. : 257 Min. :-4.50
## 1st Qu.: 942 1st Qu.: 8.90
## Median : 1957 Median :16.30
## Mean : 3400 Mean :16.91
## 3rd Qu.: 4640 3rd Qu.:24.10
## Max. :16570 Max. :45.50
##
To actually summarize the data in an appropriate fashion, we will use dplyr and summarise after grouping the data by country.
library(tidyverse)
library(rlang)
library(haven)
nlswork <- read_stata("http://www.stata-press.com/data/r13/nlswork.dta")
nlswork2 <- nlswork
nlswork2$race <- as.character(nlswork$race)
xtsum <- function(formula, data) {
require(rlang)
require(tidyverse)
pform <- terms(formula, data=data)
unit <- pform[[2]]
vars <- attr(pform, "term.labels")
cls <- sapply(data, class)
data <- data %>% select(which(cls%in%c("numeric","integer")))
varnames <- intersect(names(data),vars)
sumfunc <- function(data=data, varname, unit) {
loc.unit <- enquo(unit)
varname <- ensym(varname)
ores <- data %>% filter(!is.na(!! varname)==TRUE) %>% summarise(
O.mean=round(mean(`$`(data, !! varname), na.rm=TRUE), digits=3),
O.sd=round(sd(`$`(data, !! varname), na.rm=TRUE), digits=3),
O.min = min(`$`(data, !! varname), na.rm=TRUE),
O.max=max(`$`(data, !! varname), na.rm=TRUE),
O.SumSQ=round(sum(scale(`$`(data, !! varname), center=TRUE, scale=FALSE)^2, na.rm=TRUE), digits=3),
O.N=sum(as.numeric((!is.na(`$`(data, !! varname))))))
bmeans <- data %>% filter(!is.na(!! varname)==TRUE) %>% group_by(!! loc.unit) %>% summarise(
meanx=mean(`$`(.data, !! varname), na.rm=T),
t.count=sum(as.numeric(!is.na(`$`(.data, !! varname)))))
bres <- bmeans %>% ungroup() %>% summarise(
B.sd = round(sd(meanx, na.rm=TRUE), digits=3),
B.min = min(meanx, na.rm=TRUE),
B.max=max(meanx, na.rm=TRUE),
Units=sum(as.numeric(!is.na(t.count))),
t.bar=round(mean(t.count, na.rm=TRUE), digits=3))
wdat <- data %>% filter(!is.na(!! varname)==TRUE) %>% group_by(!! loc.unit) %>% mutate(
W.x = scale(`$`(.data,!! varname), scale=FALSE))
wres <- wdat %>% ungroup() %>% summarise(
W.sd=round(sd(W.x, na.rm=TRUE), digits=3),
W.min=min(W.x, na.rm=TRUE),
W.max=max(W.x, na.rm=TRUE),
W.SumSQ=round(sum(W.x^2, na.rm=TRUE), digits=3))
W.Ratio <- round(wres$W.SumSQ/ores$O.SumSQ, digits=3)
return(c(ores,bres,wres,Within.Ovr.Ratio=W.Ratio))
}
res1 <- sapply(varnames, function(x) {sumfunc(data, !!x, !!unit)})
return(t(res1))
}
xtsum(idcode~., data=nlswork2)
## O.mean O.sd O.min O.max O.SumSQ O.N B.sd B.min B.max
## year 77.959 6.384 68 88 1162831 28534 5.157 68 88
## birth_yr 48.085 3.013 41 54 258999.4 28534 3.052 41 54
## age 29.045 6.701 14 46 1279992 28510 5.486 14 45
## msp 0.603 0.489 0 1 6827.437 28518 0.398 0 1
## nev_mar 0.23 0.421 0 1 5045.599 28518 0.368 0 1
## grade 12.533 2.324 0 18 154082.7 28532 2.567 0 18
## collgrad 0.168 0.374 0 1 3989.224 28534 0.405 0 1
## not_smsa 0.282 0.45 0 1 5781.348 28526 0.411 0 1
## c_city 0.357 0.479 0 1 6549.949 28526 0.427 0 1
## south 0.41 0.492 0 1 6898.155 28526 0.467 0 1
## ind_code 7.693 2.994 1 12 252718.4 28193 2.543 1 12
## occ_code 4.778 3.065 1 13 266984.6 28413 2.865 1 13
## union 0.234 0.424 0 1 3452.712 19238 0.334 0 1
## wks_ue 2.548 7.294 0 76 1214713 22830 5.181 0 76
## ttl_exp 6.215 4.652 0 28.88461 617516.8 28534 3.724 0 24.7062
## tenure 3.124 3.751 0 25.91667 395453.2 28101 2.797 0 21.16667
## hours 36.56 9.87 1 168 2772858 28467 7.847 1 83.5
## wks_work 53.989 29.032 0 104 23457236 27831 20.645 0 104
## ln_wage 1.675 0.478 0 5.263916 6521.884 28534 0.425 0 3.912023
## Units t.bar W.sd W.min W.max W.SumSQ Within.Ovr.Ratio
## year 4711 6.057 5.138 -14.16667 14.75 753323.2 0.648
## birth_yr 4711 6.057 0 0 0 0 0
## age 4710 6.053 5.169 -14.25 14.75 761852.2 0.595
## msp 4711 6.053 0.324 -0.9333333 0.9333333 2991.619 0.438
## nev_mar 4711 6.053 0.246 -0.9333333 0.9333333 1720.909 0.341
## grade 4709 6.059 0 0 0 0 0
## collgrad 4711 6.057 0 0 0 0 0
## not_smsa 4711 6.055 0.183 -0.9285714 0.9333333 959.921 0.166
## c_city 4711 6.055 0.249 -0.9333333 0.9333333 1768.61 0.27
## south 4711 6.055 0.16 -0.9333333 0.9333333 728.353 0.106
## ind_code 4695 6.005 1.708 -9.2 9.428571 82284.83 0.326
## occ_code 4699 6.047 1.65 -10.3 10.66667 77374.88 0.29
## union 4150 4.636 0.267 -0.9166667 0.9166667 1369.971 0.397
## wks_ue 4645 4.915 6.054 -36.5 61.83333 836703.7 0.689
## ttl_exp 4711 6.057 3.484 -15.85799 14.1656 346367.2 0.561
## tenure 4699 5.98 2.66 -17.40278 12.5 198792.1 0.503
## hours 4710 6.044 7.521 -38.71429 93.5 1610069 0.581
## wks_work 4686 5.939 23.97 -72.42857 77.16667 15990016 0.682
## ln_wage 4711 6.057 0.293 -2.082629 3.108762 2443.848 0.375
The previous command takes a panel dataset organized by the unit and decomposes the within and between variance and standard deviation, etc. To summarize an individual variable in the same fashion, the command XTSUM below should accomplish the task.
XTSUM <- function(data, varname, unit) {
varname <- enquo(varname)
unit <- enquo(unit)
ores <- nlswork %>% summarise(ovr.mean=mean(!! varname, na.rm=TRUE), ovr.sd=sd(!! varname, na.rm=TRUE), ovr.min = min(!! varname, na.rm=TRUE), ovr.max=max(!! varname, na.rm=TRUE))
bmeans <- nlswork %>% group_by(!!unit) %>% summarise(meanx=mean(!! varname, na.rm=T))
bres <- bmeans %>% summarise(between.sd = sd(meanx, na.rm=TRUE), between.min = min(meanx, na.rm=TRUE), between.max=max(meanx, na.rm=TRUE), Groups=n())
wdat <- nlswork %>% group_by(!!unit) %>% mutate(W.x = scale(!! varname, scale=FALSE))
wres <- wdat %>% ungroup() %>% summarise(within.sd=sd(W.x, na.rm=TRUE), within.min=min(W.x, na.rm=TRUE), within.max=max(W.x, na.rm=TRUE))
return(list(ores=ores,bres=bres,wres=wres))
}
XTSUM(nlswork, varname=hours, unit=idcode)
## $ores
## # A tibble: 1 x 4
## ovr.mean ovr.sd ovr.min ovr.max
## <dbl> <dbl> <dbl> <dbl>
## 1 36.6 9.87 1 168
##
## $bres
## # A tibble: 1 x 4
## between.sd between.min between.max Groups
## <dbl> <dbl> <dbl> <int>
## 1 7.85 1 83.5 4711
##
## $wres
## # A tibble: 1 x 3
## within.sd within.min within.max
## <dbl> <dbl> <dbl>
## 1 7.52 -38.7 93.5
Models for panel data: Standard models in plm()
The workhorse package/library for the basic panel data models in R is plm. The JStatSoft article was modified to become the vignette and is quite detailed, though dated. The previous equation can be simplified and detailed to derive the commonly deployed models.
A Pooled Regression
Simply suppresses all subscripts on \(\alpha\) and \(\beta\) to write:
\[ y_{it} = \alpha + X_{it}\beta + \epsilon_{it} \]
A Within Regression
\[ y_{it} = \alpha + X_{it}\beta^{W} + \epsilon^{W}_{it} \]
A Between Regression
\[ \overline{y}_{i} = \alpha + \overline{X}_{i}\beta^{B} + \epsilon^{B}_{i} \]
Models with heterogeneous time trends: phtt()
An article on panel data analysis with heterogeneous time trends appears in JStatSoft.
Multi-state models in continuous time msm()
The msm library fits homogenous and inhomogenous Markov models for multinomial outcomes.
Latent Markov models with LMest()
LM models are designed to deal with univariate and multivariate longitudinal data based on the repeated observation of a panel of subjects across time. More in detail, LM models are specially tailored to study the evolution of an individual characteristic of interest that is not directly observable. This characteristic is represented by a latent process following a Markov chain as in a Hidden Markov (HM) model (Zucchini and MacDonald, 2009). These models also allow us to account for time-varying unobserved heterogeneity in addition to the effect of observable covariates on the response variables.
The preprint is here.
cquad
In the study of non-linear panel data models, unobserved heterogeneity and state dependence are hardest with the least information – binary outcomes. The cquad package for Stata and R utilizes an approach by Bartolucci and Nigro described in Bartolucci and Pigini.