February 24, 2018

Longitudinal Panel Data R Packages

Longitudinal and Panel Data Analysis in R

Goal: A CRAN task view for panel/longitudinal data analysis in R.

What is Panel Data?

Panel data are variously called longitudinal, panel, cross-sectional time series, and pooled time series data. The most precise definition is two-dimensional data; invariably one of the dimensions is time. We can think about a general depiction of what a model with linear coefficients typical for such data structures, though ridiculously overparameterized, like so:

\[ y_{it} = \alpha_{it} + X_{it}\beta_{it} + \epsilon_{it} \]

For example, suppose that we have a set of countries, or US states, municipalities in a state, or even individual people, indexed by \(i \in N\). We observe those data at \(t \in T\) points in time. To be clear, a standard cross-section of data in this notation suppresses \(t\) because there is no observed over time variation. A single time series suppresses the \(i\). Of course, \(N\) and \(T\) are greater than 1.

Common transformations: ANOVA

There are common transformations undertaken on the data prior to analysis. For example, the within transformation that underlies a fixed effects regression model transforms each vector of data (for \(i\)) as follows:

\[ x^{W}_{it} = x_{it} - \overline{x}_{i} \]

The within-transformed data are of identical dimensions to the untransformed data. The other common transformation isolates the \(n-vector\) of unit means – the so-called between data. Because the dimensions are non-overlapping/orthogonal; the key result in the applied statistics shows that the total variation in a variable \(x\) is the sum of the within variance and the between variance.

Some Basic Summary

dplyr makes the basic summary and presentation quite straightforward. plm contains a few datasets; I will choose an excerpt from Summers and Heston’s Penn World Tables – SumHes.

library(plm)
library(tidyverse)
library(dplyr)
data(SumHes)
summary(SumHes)

##       year              country      opec       com            pop         
##  Min.   :1960   ALGERIA     :  26   no :3146   no :3120   Min.   :     42  
##  1st Qu.:1966   ANGOLA      :  26   yes: 104   yes: 130   1st Qu.:   2266  
##  Median :1972   BENIN       :  26                         Median :   5919  
##  Mean   :1972   BOTSWANA    :  26                         Mean   :  29436  
##  3rd Qu.:1979   BURKINA FASO:  26                         3rd Qu.:  18529  
##  Max.   :1985   BURUNDI     :  26                         Max.   :1051013  
##                 (Other)     :3094                                          
##       gdp              sr       
##  Min.   :  257   Min.   :-4.50  
##  1st Qu.:  942   1st Qu.: 8.90  
##  Median : 1957   Median :16.30  
##  Mean   : 3400   Mean   :16.91  
##  3rd Qu.: 4640   3rd Qu.:24.10  
##  Max.   :16570   Max.   :45.50  
##

To actually summarize the data in an appropriate fashion, we will use dplyr and summarise after grouping the data by country.

library(tidyverse)
library(rlang)
library(haven)
nlswork <- read_stata("http://www.stata-press.com/data/r13/nlswork.dta")
nlswork2 <- nlswork
nlswork2$race <- as.character(nlswork$race)
xtsum <- function(formula, data) {
  require(rlang)
  require(tidyverse)
  pform <- terms(formula, data=data)
  unit <- pform[[2]]
  vars <- attr(pform, "term.labels")
  cls <- sapply(data, class)
  data <- data %>% select(which(cls%in%c("numeric","integer")))
  varnames <- intersect(names(data),vars)
  sumfunc <- function(data=data, varname, unit) {
  loc.unit <- enquo(unit)
  varname <- ensym(varname)
    ores <- data %>% filter(!is.na(!! varname)==TRUE) %>% summarise(
    O.mean=round(mean(`$`(data, !! varname), na.rm=TRUE), digits=3),
    O.sd=round(sd(`$`(data, !! varname), na.rm=TRUE), digits=3), 
    O.min = min(`$`(data, !! varname), na.rm=TRUE), 
    O.max=max(`$`(data, !! varname), na.rm=TRUE), 
    O.SumSQ=round(sum(scale(`$`(data, !! varname), center=TRUE, scale=FALSE)^2, na.rm=TRUE), digits=3), 
    O.N=sum(as.numeric((!is.na(`$`(data, !! varname))))))
 bmeans <- data %>% filter(!is.na(!! varname)==TRUE) %>% group_by(!! loc.unit) %>% summarise(
   meanx=mean(`$`(.data, !! varname), na.rm=T), 
   t.count=sum(as.numeric(!is.na(`$`(.data, !! varname)))))
  bres <- bmeans %>% ungroup() %>% summarise(
    B.sd = round(sd(meanx, na.rm=TRUE), digits=3),
    B.min = min(meanx, na.rm=TRUE), 
    B.max=max(meanx, na.rm=TRUE), 
    Units=sum(as.numeric(!is.na(t.count))), 
    t.bar=round(mean(t.count, na.rm=TRUE), digits=3))
  wdat <- data %>% filter(!is.na(!! varname)==TRUE) %>% group_by(!! loc.unit) %>% mutate(
    W.x = scale(`$`(.data,!! varname), scale=FALSE))
  wres <- wdat %>% ungroup() %>% summarise(
    W.sd=round(sd(W.x, na.rm=TRUE), digits=3), 
    W.min=min(W.x, na.rm=TRUE), 
    W.max=max(W.x, na.rm=TRUE), 
    W.SumSQ=round(sum(W.x^2, na.rm=TRUE), digits=3))
    W.Ratio <- round(wres$W.SumSQ/ores$O.SumSQ, digits=3)
  return(c(ores,bres,wres,Within.Ovr.Ratio=W.Ratio))
  }
res1 <- sapply(varnames, function(x) {sumfunc(data, !!x, !!unit)})
return(t(res1))
}  
xtsum(idcode~., data=nlswork2)

##          O.mean O.sd   O.min O.max    O.SumSQ  O.N   B.sd   B.min B.max   
## year     77.959 6.384  68    88       1162831  28534 5.157  68    88      
## birth_yr 48.085 3.013  41    54       258999.4 28534 3.052  41    54      
## age      29.045 6.701  14    46       1279992  28510 5.486  14    45      
## msp      0.603  0.489  0     1        6827.437 28518 0.398  0     1       
## nev_mar  0.23   0.421  0     1        5045.599 28518 0.368  0     1       
## grade    12.533 2.324  0     18       154082.7 28532 2.567  0     18      
## collgrad 0.168  0.374  0     1        3989.224 28534 0.405  0     1       
## not_smsa 0.282  0.45   0     1        5781.348 28526 0.411  0     1       
## c_city   0.357  0.479  0     1        6549.949 28526 0.427  0     1       
## south    0.41   0.492  0     1        6898.155 28526 0.467  0     1       
## ind_code 7.693  2.994  1     12       252718.4 28193 2.543  1     12      
## occ_code 4.778  3.065  1     13       266984.6 28413 2.865  1     13      
## union    0.234  0.424  0     1        3452.712 19238 0.334  0     1       
## wks_ue   2.548  7.294  0     76       1214713  22830 5.181  0     76      
## ttl_exp  6.215  4.652  0     28.88461 617516.8 28534 3.724  0     24.7062 
## tenure   3.124  3.751  0     25.91667 395453.2 28101 2.797  0     21.16667
## hours    36.56  9.87   1     168      2772858  28467 7.847  1     83.5    
## wks_work 53.989 29.032 0     104      23457236 27831 20.645 0     104     
## ln_wage  1.675  0.478  0     5.263916 6521.884 28534 0.425  0     3.912023
##          Units t.bar W.sd  W.min      W.max     W.SumSQ  Within.Ovr.Ratio
## year     4711  6.057 5.138 -14.16667  14.75     753323.2 0.648           
## birth_yr 4711  6.057 0     0          0         0        0               
## age      4710  6.053 5.169 -14.25     14.75     761852.2 0.595           
## msp      4711  6.053 0.324 -0.9333333 0.9333333 2991.619 0.438           
## nev_mar  4711  6.053 0.246 -0.9333333 0.9333333 1720.909 0.341           
## grade    4709  6.059 0     0          0         0        0               
## collgrad 4711  6.057 0     0          0         0        0               
## not_smsa 4711  6.055 0.183 -0.9285714 0.9333333 959.921  0.166           
## c_city   4711  6.055 0.249 -0.9333333 0.9333333 1768.61  0.27            
## south    4711  6.055 0.16  -0.9333333 0.9333333 728.353  0.106           
## ind_code 4695  6.005 1.708 -9.2       9.428571  82284.83 0.326           
## occ_code 4699  6.047 1.65  -10.3      10.66667  77374.88 0.29            
## union    4150  4.636 0.267 -0.9166667 0.9166667 1369.971 0.397           
## wks_ue   4645  4.915 6.054 -36.5      61.83333  836703.7 0.689           
## ttl_exp  4711  6.057 3.484 -15.85799  14.1656   346367.2 0.561           
## tenure   4699  5.98  2.66  -17.40278  12.5      198792.1 0.503           
## hours    4710  6.044 7.521 -38.71429  93.5      1610069  0.581           
## wks_work 4686  5.939 23.97 -72.42857  77.16667  15990016 0.682           
## ln_wage  4711  6.057 0.293 -2.082629  3.108762  2443.848 0.375

The previous command takes a panel dataset organized by the unit and decomposes the within and between variance and standard deviation, etc. To summarize an individual variable in the same fashion, the command XTSUM below should accomplish the task.

XTSUM <- function(data, varname, unit) {
  varname <- enquo(varname)
  unit <- enquo(unit)
  ores <- nlswork %>% summarise(ovr.mean=mean(!! varname, na.rm=TRUE), ovr.sd=sd(!! varname, na.rm=TRUE), ovr.min = min(!! varname, na.rm=TRUE), ovr.max=max(!! varname, na.rm=TRUE))
  bmeans <- nlswork %>% group_by(!!unit) %>% summarise(meanx=mean(!! varname, na.rm=T))
  bres <- bmeans %>% summarise(between.sd = sd(meanx, na.rm=TRUE), between.min = min(meanx, na.rm=TRUE), between.max=max(meanx, na.rm=TRUE), Groups=n())
  wdat <- nlswork %>% group_by(!!unit) %>% mutate(W.x = scale(!! varname, scale=FALSE))
  wres <- wdat %>% ungroup() %>% summarise(within.sd=sd(W.x, na.rm=TRUE), within.min=min(W.x, na.rm=TRUE), within.max=max(W.x, na.rm=TRUE))
  return(list(ores=ores,bres=bres,wres=wres))
}
XTSUM(nlswork, varname=hours, unit=idcode)

## $ores
## # A tibble: 1 x 4
##   ovr.mean ovr.sd ovr.min ovr.max
##      <dbl>  <dbl>   <dbl>   <dbl>
## 1     36.6   9.87       1     168
## 
## $bres
## # A tibble: 1 x 4
##   between.sd between.min between.max Groups
##        <dbl>       <dbl>       <dbl>  <int>
## 1       7.85           1        83.5   4711
## 
## $wres
## # A tibble: 1 x 3
##   within.sd within.min within.max
##       <dbl>      <dbl>      <dbl>
## 1      7.52      -38.7       93.5

Models for panel data: Standard models in plm()

The workhorse package/library for the basic panel data models in R is plm. The JStatSoft article was modified to become the vignette and is quite detailed, though dated. The previous equation can be simplified and detailed to derive the commonly deployed models.

A Pooled Regression

Simply suppresses all subscripts on \(\alpha\) and \(\beta\) to write:

\[ y_{it} = \alpha + X_{it}\beta + \epsilon_{it} \]

A Within Regression

\[ y_{it} = \alpha + X_{it}\beta^{W} + \epsilon^{W}_{it} \]

A Between Regression

\[ \overline{y}_{i} = \alpha + \overline{X}_{i}\beta^{B} + \epsilon^{B}_{i} \]

Models with heterogeneous time trends: phtt()

An article on panel data analysis with heterogeneous time trends appears in JStatSoft.

Multi-state models in continuous time msm()

The msm library fits homogenous and inhomogenous Markov models for multinomial outcomes.

Latent Markov models with LMest()

LM models are designed to deal with univariate and multivariate longitudinal data based on the repeated observation of a panel of subjects across time. More in detail, LM models are specially tailored to study the evolution of an individual characteristic of interest that is not directly observable. This characteristic is represented by a latent process following a Markov chain as in a Hidden Markov (HM) model (Zucchini and MacDonald, 2009). These models also allow us to account for time-varying unobserved heterogeneity in addition to the effect of observable covariates on the response variables.

The preprint is here.

cquad

In the study of non-linear panel data models, unobserved heterogeneity and state dependence are hardest with the least information – binary outcomes. The cquad package for Stata and R utilizes an approach by Bartolucci and Nigro described in Bartolucci and Pigini.