6  Longitudinal Data

Although the term longitudinal naturally suggests that data are collected over time, the models and methods we will discuss broadly apply to any kind of repeated measurement data. That is, although repeated measurements most often take place over time, this is not the only way that measures may be taken repeatedly on the same unit. For example,

Thus, the methods will apply more broadly than the strict definition of the term longitudinal data indicates – the term will mean, to us, data in the form of repeated measurements that may well be over time but may also be over some other set of conditions. Because time is most often the measurement condition, however, many of our examples will involve repeated measurement over time. We will use the term outcome to denote the measurement of interest. Because units are often human or animal subjects, we use the terms unit, individual, and subject interchangeably.

6.1 Sources of Variation

We now consider why the values of a characteristic that we might observe vary over time.

  • Biological variation: It is well-known that biological entities are different. However, living things of the same type tend to be similar in their characteristics; they are not the same (except perhaps in the case of genetically-identical clones). Thus, even if we focus on rats of the same genetic strain, age, and gender, we expect variation in the possible weights of such rats that we might observe due to inherent, natural biological variation. This is variation due to genetics, environmental, and behavioral factors. Thus, this variation is at the unit-level, so we see between unit variation.

  • Variation due to Condition or Time : Entities change over time and under different conditions. Suppose we consider rats over time or under various dietary conditions. In that case, we expect variation in the possible weights of such rats that we might observe due to variation due to condition or time. Thus, this variation is at the observation-level so we see within unit variation.

  • Measurement error: We have discussed rat weight as though once we have a rat in hand, we may know its weight exactly. However, a scale must be used. Ideally, a scale should register the true weight of an item each time it is weighed, but because such devices are imperfect, scale measurements on the same item may vary time after time. The amount by which the measurement differs from the truth may be considered an error, i.e., a deviation up or down from the true value that could be observed with a perfect device. A fair or unbiased device does not systematically register high or low most of the time; rather, the errors may go in either direction with no pattern. Thus, this variation is typically at the observation-level, so we see within unit variation.

There are still further sources of variation that we could consider. They could be at the unit-level or the observational-level. For now, the important message is that, in considering statistical models, it is critical to be aware of different sources of variation that cause observations to vary.

With cross-sectional data (one observed point in time in units sampled from across the population - no repeated measures), we

  • cannot distinguish between the types of variation
  • use explanatory variables to try and explain any sources of variation (biological and other)
  • use model error as a catch-all to account for any leftover variation (measurement or other)

With longitudinal data, since we have repeated measurements on units, we

  • separate variation within units (measurement and others) from variation between units (biological and others)
  • use time-varying explanatory variables to try and explain within unit variation
  • use time-invariant explanatory variables to try and explain between unit variation
  • use probability models to account for left over within or between variation

6.2 Data Examples

We can consider several real datasets from various applications to put things into a firmer perspective. These will not only provide us with concrete examples of longitudinal data situations but also illustrate the range of ways that data may be collected and the types of measurements that may be of interest.

6.2.1 Example 1: The orthodontic study data of Potthoff and Roy (1964).

6.2.1.1 Data Context

A study was conducted involving 27 children, 16 boys and 11 girls. On each child, the distance (mm) from the center of the pituitary to the pterygomaxillary fissure was made at ages 8, 10, 12, and 14 years of age. In the plot below, the distance measurements are plotted against the age of each child. The trajectory for each child is connected by a solid line so that individual child patterns may be seen, and the color of the lines denotes girls (0) and boys (1).

dat <- read.table("./data/dental.dat", col.names=c("obsno", "id", "age", "distance", "gender"))

dat %>%
  ggplot(aes(x = age, y = distance, col = factor(gender))) +
  geom_line(aes(group = id)) +
  geom_smooth(method='lm',se=FALSE,lwd=3) +
  theme_minimal() + 
  scale_color_discrete('Gender')
`geom_smooth()` using formula = 'y ~ x'

Plots like this are often called spaghetti plots, for obvious reasons! Each set of connected points represented one child and their outcome measurements over time.

6.2.1.2 Research Questions

The objectives of the study were to

  • Determine whether distances over time are larger for boys than for girls
  • Determine whether the rate of change (i.e., slope) for distance is similar between boys and girls.

Several features are notable from the spaghetti plot of the data:

  • Each child appears to have his/her own trajectory of distance as a function of age. For any given child, the trajectory looks roughly like a straight line, with some random fluctuations. But from child to child, the features of the trajectory (e.g., its steepness) vary. Thus, the trajectories are all similar in form but vary in their specific characteristics among children. Also, note any unusual trajectories. In particular, there is one boy whose pattern fluctuates more profoundly than those of the other children and one girl whose distance is much “lower” than the others at all time points.

  • The overall trend is for the distance measurement to increase with age. The trajectories for some children exhibit strict increases with age. In contrast, others show some intermittent decreases (biologically, is this possible? or just due to measurement error?), but still with an overall increasing trend across the entire six-year period.

  • The distance trajectories for boys seem, for the most part, to be “higher” than those for girls – most of the boy profiles involve larger distance measurements than those for girls. However, this is not uniformly true: some girls have larger distance measurements than boys at some ages.

  • Although boys seem to have larger distance measurements, the rate of change of the measurements with increasing age seems similar. More precisely, the slope of the increasing (approximate straight-line) relationship with age seems roughly similar for boys and girls. However, for any individual boy or girl, the rate of change (slope) may be steeper or shallower than the evident “typical” rate of change.

To address the questions of interest, we need a formal way of representing the fact that each child has an individual-specific trajectory.

6.2.2 Example 2: Vitamin E diet supplement and growth of guinea pigs

6.2.2.1 Data Context

The following data are reported by Crowder and Hand (1990, p. 27). The study concerned the effect of a vitamin E diet supplement on the growth of guinea pigs. 15 guinea pigs were given a growth-inhibiting substance at the beginning of week 1 of the study (time 0, before the first measurement), and body weight was measured at the ends of weeks 1, 3, and 4. At the beginning of week 5, the pigs were randomized into three groups of 5, and vitamin E therapy was started. One group received zero doses of vitamin E, another received a low dose, and the third received a high dose. Each guinea pig’s body weight (g) was measured at the end of weeks 5, 6, and 7. The data for the three dose groups are plotted, in the plot below, with each line representing the body weight of one pig.

dat <- read.table("./data/diet.dat", col.names=c("id", paste("bw.",c(1,3,4,5,6,7),sep=""), "dose" ))

dat <- dat %>%
  gather(Tmp,bw, bw.1:bw.7) %>%
  separate(Tmp,into=c('Var','time'), remove=TRUE) 

dat$time <- as.numeric(dat$time)
dat$dose <- factor(dat$dose)
levels(dat$dose) <- c("Zero dose",' Low dose', 'High dose')
  

dat %>%
  ggplot(aes(x = time, y = bw, col = dose)) +
  geom_line(aes( group = id)) + 
  xlab('Weeks') +
  ylab('Body Weight (g)') +
  geom_smooth(lwd=2, se=FALSE) +
  theme_minimal()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

6.2.2.2 Research Questions

The primary objective of the study was to

  • Determine whether the growth patterns differed among the three groups.

As with the dental data, several features of the spaghetti plot are evident:

  • For the most part, the trajectories for individual guinea pigs seem to increase overall over the study period (although note pig 1 in the zero dose group). Guinea pigs in the same dose group have different trajectories, some of which look like a straight line and others of which seem to have a “dip” at the beginning of week 5, the time at which vitamin E was added in the low and high-dose groups.

  • The body weight for the zero dose group seems somewhat “lower” than those in the other dose groups.

  • It is unclear whether the rate of change in body weight on average is similar or different across dose groups. It is unclear that the pattern for individual pigs or “on average” is a straight line, so the rate of change may not be constant. Because vitamin E therapy was not administered until the beginning of week 5, we might expect two “phases” in the growth pattern, before and after vitamin E, making it possibly non-linear.

6.2.3 Example 3: Epileptic seizures and chemotherapy

A common situation is where the measurements are in the form of counts. A response in the form of a count is by nature discrete–counts (usually) take only nonnegative integer values (0, 1, 2, 3,…).

6.2.3.1 Data Context

The following data were first reported by Thall and Vail (1990). A clinical trial was conducted in which 59 people with epilepsy suffering from simple or partial seizures were assigned at random to receive either the anti-epileptic drug progabide (subjects 29-59) or an inert substance (a placebo, subjects 1-28) in addition to a standard chemotherapy regimen all were taking. Because each individual might be prone to different rates of experiencing seizures, the investigators first tried to get a sense of this by recording the number of seizures suffered by each subject over the 8 weeks before administering the assigned treatment. It is common in such studies to record such baseline measurements so that the effect of treatment for each subject may be measured relative to how that subject behaved before treatment.

Following the commencement of treatment, each subject’s seizures were counted for four two-week consecutive periods. The age of each subject at the start of the study was also recorded, as it was suspected that the subject’s age might be associated with the effect of the treatment.

The first 10 rows of the long format data for the study are shown below.

require(MASS)
head(epil,10)
   y     trt base age V4 subject period      lbase        lage
1  5 placebo   11  31  0       1      1 -0.7563538  0.11420370
2  3 placebo   11  31  0       1      2 -0.7563538  0.11420370
3  3 placebo   11  31  0       1      3 -0.7563538  0.11420370
4  3 placebo   11  31  1       1      4 -0.7563538  0.11420370
5  3 placebo   11  30  0       2      1 -0.7563538  0.08141387
6  5 placebo   11  30  0       2      2 -0.7563538  0.08141387
7  3 placebo   11  30  0       2      3 -0.7563538  0.08141387
8  3 placebo   11  30  1       2      4 -0.7563538  0.08141387
9  2 placebo    6  25  0       3      1 -1.3624896 -0.10090768
10 4 placebo    6  25  0       3      2 -1.3624896 -0.10090768

6.2.3.2 Research Questions

The primary objective of the study was to

  • Determine whether progabide reduces the rate of seizures in subjects like those in the trial.

Here, we have repeated measurements (counts) on each subject over four consecutive observation periods for each subject. We would like to compare the baseline seizure counts to post-treatment counts, where the latter are observed repeatedly over time following the initiation of treatment. An appropriate analysis would best use this data feature in addressing the main objective. Below is a boxplot of the change from baseline, separated by treatment group (but ignores the repeated measures).

epil %>%
  mutate(change = y - base) %>%
  ggplot(aes(x = trt, y = change)) +
  geom_boxplot() + 
  theme_classic()

Moreover, some counts are quite small; for some subjects, 0 seizures (none) were experienced in some periods. For example, subject 31 in the treatment group experienced only 0, 3, or 4 seizures over the four observation periods. Pretending that the response is continuous would be a lousy approximation of the true nature of the data! Thus, methods suitable for handling continuous data problems like the first three examples would not be appropriate for data like these.

A common approach to handling data in the form of counts is to transform them to some other scale. The motivation is to make them seem more “normally distributed” with constant variance, and the logarithm transformation is used to (hopefully) accomplish this. The desired result is that methods usually used to analyze continuous measurements may be applied.

However, the drawback of this approach is that one is no longer working with the data on the original scale of measurement, the number of seizures in this case. The statistical models the “log number of seizures,” which is not particularly interesting or intuitive. New statistical methods have recently been developed to analyze discrete repeated measurements like counts on the original measurement scale.

6.2.4 Example 4: Maternal smoking and child respiratory health

Another common discrete data situation is where the response is binary; that is, the response may take on only two possible values, which usually correspond to things like

  • success or failure of a treatment to elicit a desired response
  • presence or absence of some condition

It would be foolish to even try and pretend such data are approximately continuous!

6.2.4.1 Data Context

The following data come from a very large public health study, the Six Cities Study, undertaken in six small American cities to investigate various public health issues. The full situation is reported in Lipsitz, Laird, and Harrington (1992). The current study focused on the association between maternal smoking and child respiratory health. Each of the 300 children was examined once a year at ages 9–12. The response of interest was “wheezing status, a measure of the child’s respiratory health, which was coded as either”no” (0) or “yes” (1), where “yes” corresponds to respiratory problems. Also recorded at each examination was a code to indicate the mother’s current level of smoking: 0 = none, 1 = moderate, 2 = heavy. The data for the first 5 subjects are summarized below. Missing data are denoted by a “.”.

Smoking at age Wheezing at age
Subject City 9 10 11 12 9 10 11 12
1 Portage 2 2 1 1 1 0 0 0
2 Kingston 0 0 0 0 0 0 0 0
3 Portage 1 0 0 . 0 0 0 .
4 Portage . 1 1 1 . 1 0 0
5 Kingston 1 . 1 2 0 . 0 1

A simplified version of this data is available in R.

require(geepack)
data(ohio)
head(ohio,10)
   resp id age smoke
1     0  0  -2     0
2     0  0  -1     0
3     0  0   0     0
4     0  0   1     0
5     0  1  -2     0
6     0  1  -1     0
7     0  1   0     0
8     0  1   1     0
9     0  2  -2     0
10    0  2  -1     0

6.2.4.2 Research Questions

The objective of an analysis of these data was to

  • Determine how the typical “wheezing” response pattern changes with age
  • Determine whether there is an association between maternal smoking severity and child respiratory status (as measured by “wheezing”).

It would be pretty pointless to plot the responses as a function of age as we did in the continuous data cases – here, the only responses are 0 or 1! Inspection of subject data suggests something is happening here; for example, subject 5 did not exhibit positive wheezing until his/her mother’s smoking increased in severity.

This highlights that this situation is complex: over time (measured here by the child’s age), an important characteristic, maternal smoking, changes. Contrast this with the previous situations, where the main focus is to compare groups whose membership stays constant over time.

Thus, we have repeated measurements, which are binary to further complicate matters! As with the count data, one might first think about summarizing and transforming the data to allow methods for continuous data to be used; however, this would be inappropriate. As we will see later, methods for dealing with repeated binary responses and scientific questions like those above have been developed.

Another feature of these data is that some measurements are missing for some subjects. Specifically, although the intention was to collect data for each of the four ages, this information is not available for some children and their mothers at some ages; for example, subject 3 has both the mother’s smoking status and wheezing indicator missing at age 12. This pattern would suggest that the mother may have failed to appear with the child for this intended examination.

A final note: In the other examples, units (children, guinea pigs, plots, patients) were assigned to treatments; thus, these may be regarded as controlled experiments, where the investigator has some control over how the factors of interest are “applied” to the units (through randomization). In contrast, in this study, the investigators did not decide which children would have mothers who smoke; instead, they could only observe the smoking behavior of the mothers and the wheezing status of their children. That is, this is an example of an observational study. Because it may be impossible or unethical to randomize subjects to potentially hazardous circumstances, studies of issues in public health and the social sciences are often observational.

As in many observational studies, an additional difficulty is the fact that the thing of interest, in this case, maternal smoking, also changes with the response over time. This leads to complicated issues of interpretation in statistical modeling that are a matter of some debate.

6.3 R: Wide V. Long Format

In R, longitudinal data could be formatted in two ways.

Wide Format: When the observations times are the same for every unit (balanced data), then every unit has a vector of observations of length \(m\), and we can organize the unit’s data into one row per unit, resulting in a data set with \(n\) rows and \(m\) columns that correspond to outcome values (plus other columns that will correspond to an identifier variable and other explanatory variables). See the simulated example below. The first column is an id column to identify the units from each other, the next five columns correspond to the \(m=5\) observations over time, and the next five columns correspond to a time-varying explanatory variable.

n = 10
m = 5

(wideD = data.frame(id = 1:n, y.1 = rnorm(n), y.2 = rnorm(n), y.3 = rnorm(n), y.4 = rnorm(n), y.5 = rnorm(n),x.1 = rnorm(n), x.2 = rnorm(n), x.3 = rnorm(n), x.4 = rnorm(n), x.5 = rnorm(n))) #Example with 5 repeated observations for 10 units
   id        y.1         y.2         y.3          y.4        y.5         x.1
1   1  0.2229190 -0.40044017  1.04638018  0.951190268  0.2529054  0.23509587
2   2 -1.7574881  0.29048990  0.40700765  0.851799656  0.6963213  0.61780031
3   3  0.9101764 -0.04812071 -0.36169694 -0.017877035  2.3546848 -0.06112846
4   4 -0.8761556  0.54827634 -1.64282030  0.005730991 -0.3832106  0.10388365
5   5  0.3789327  0.67360216 -0.07102116  1.416815710 -0.3180320 -1.01445579
6   6 -0.4156297  0.95479619 -0.13246836 -1.422699903  0.7574937  0.94883833
7   7 -1.2117860 -0.18627411  0.91737807  0.753403943 -1.6817573 -0.32346929
8   8  0.6907564 -0.40625606  0.83640191 -1.033449971 -0.2821755 -2.12912086
9   9 -1.4548177  1.91416608 -0.39439335  1.819379402 -0.6446930  1.35694217
10 10 -2.8261005  1.81688850 -0.55797786  0.658739791 -1.0445874 -0.34820681
           x.2        x.3         x.4         x.5
1  -1.29526633  0.6891336  0.37959480  0.19682381
2  -1.01496311 -0.5206796  0.20125502 -0.93514929
3   1.07485761  0.2030646  0.26247952 -0.07891467
4  -0.31060443  0.2019785 -0.19075250 -0.44468166
5  -0.09328145  1.2249928 -1.53911299  1.53611618
6   1.29625614  1.0926679 -0.99434301  0.99971318
7   0.14059143  0.4700600  0.09758027  0.45289454
8   1.36626416 -0.2044600  0.75692999 -1.06383405
9   0.17185316  0.4349997 -0.61865890  1.19560854
10 -1.13162510 -0.4767187  0.53168931 -0.89197931

Long Format: We’ll need the data in long format for most data analysis. We must use a long format when every unit’s observation times differ. Imagine stacking each unit’s observed outcome vectors on top of each other. Similarly, we want to stack the observed vectors of any other explanatory variable we might have on the individuals. In contrast to wide format, it is necessary to have a variable to identify the unit and the observation time.

See below for the R code to convert a data set in wide format to one in long format. Hopefully, the variable names are given in a way that specifies the variable name and the time order of the values, such as y.1,y.2,...y.5,x.1,x.2,...,x.5.

In the tidyr package, pivot_longer() takes the wide data, the columns (cols) you want to make longer, and the names of two variables you want to create. The first (we use Tmp here) is the variable name containing the variable names of the columns you want to gather. The second (called Value) is the variable name that will collect the values from those columns. See below.

require(tidyr)

pivot_longer(wideD, cols = y.1:x.5, names_to = 'Tmp', values_to = 'Value') %>% head()
# A tibble: 6 × 3
     id Tmp    Value
  <int> <chr>  <dbl>
1     1 y.1    0.223
2     1 y.2   -0.400
3     1 y.3    1.05 
4     1 y.4    0.951
5     1 y.5    0.253
6     1 x.1    0.235

Then, we want to separate the Tmp variable into two variables because they contain information about both the characteristic and time. We can use separate() to separate Tmp into Var and Time, it will automatically detect the . as a separator.

pivot_longer(wideD, cols = y.1:x.5, names_to = 'Tmp', values_to = 'Value') %>%
  separate(Tmp,into=c('Var','Time'), remove=TRUE)  %>%
  head()
# A tibble: 6 × 4
     id Var   Time   Value
  <int> <chr> <chr>  <dbl>
1     1 y     1      0.223
2     1 y     2     -0.400
3     1 y     3      1.05 
4     1 y     4      0.951
5     1 y     5      0.253
6     1 x     1      0.235

Lastly, we want to have one variable called x and one variable called y', and we can get that by pivoting the variableVarwider into two columns with the values that come fromValue`.

pivot_longer(wideD, cols = y.1:x.5, names_to = 'Tmp', values_to = 'Value') %>%
  separate(Tmp,into=c('Var','Time'), remove=TRUE) %>%
  pivot_wider(names_from = 'Var', values_from = 'Value') %>%
  head()
# A tibble: 6 × 4
     id Time       y      x
  <int> <chr>  <dbl>  <dbl>
1     1 1      0.223  0.235
2     1 2     -0.400 -1.30 
3     1 3      1.05   0.689
4     1 4      0.951  0.380
5     1 5      0.253  0.197
6     2 1     -1.76   0.618

6.4 Notation

In order to

  • elucidate the assumptions made under different models and methods, and
  • describe the models and methods more easily,

it is convenient to think of all outcomes collected on the same unit over time or another set of conditions together so that complex relationships among them may be summarized.

Consider the random variable,

\[ Y_{ij} = \text{the }j\text{th measurement taken on unit }i,\quad\quad i= 1,...,n, j =1,...,m\]

Consider the dental study data (Example 1) to make this more concrete. Each child was measured 4 times at ages 8, 10, 12, and 14 years. Thus, we let \(j = 1,..., 4\); \(j\) index the measurement order on a child. To summarize the information on when these times occur, we might further define

\[ t_{ij} = \text{ the time at which the }j\text{ measurement on unit i was taken.}\]

Here, for all children (\(i=1,...,27\)), \(t_{i1} = 8, t_{i2} = 10\), and so on for all children in the study. Thus, \(t_{ij} = t_j\) for all \(i=1,...,n\). If we ignore the gender of the children for the moment, the outcomes for the \(i\)th child, where \(i\) ranges from 1 to 27, are \(Y_{i1},..., Y_{i4}\), taken at times \(t_{i1},...,t_{i4}\). We may summarize the measurements for the \(i\)th child even more succinctly: define the (4 × 1) random vector,

\[ \mathbf{Y}_i = \left(\begin{array}{c}Y_{i1}\\ Y_{i2}\\ Y_{i3}\\ Y_{i4} \end{array}\right).\]

The vector elements are random variables representing the outcomes that might be observed for child \(i\) at each time point. For this data set, the data are balanced because the observation times are the same for each unit, and the data are regular because the observation times are equally spaced apart. Most observational data is not balanced and is often irregular. We can generalize our notation to allow for this type of data by changing \(m\) to \(m_i\), which captures the total number of measurements for the \(i\)th unit,

\[ Y_{ij} = \text{the }j\text{th measurement taken on unit }i, \quad\quad i= 1,...,n, j =1,...,m_i\]

The important message is that it is possible to represent the outcomes for the \(i\)th child in a very streamlined and convenient way. Each child \(i\) has its outcome vector, \(\mathbf{Y}_i\). It often makes sense to think of the data not just as individual outcomes \(Y_{ij}\), some from one child, some from another according to the indices, but rather as vectors corresponding to children, the units–each unit has associated with it an entire outcome vector.

We can also consider explanatory variables. If we have \(p\) explanatory variables, we’ll let the value for the 1st variable, for the ith unit at the jth time, be represented as \(x_{ij1}\). That means that for the ith unit, we can collect all of their values of explanatory variables (across time) in a matrix,

\[\mathbf{X}_i = \left(\begin{array}{cccc} x_{i11}&x_{i12}&\cdots&x_{i1p}\\ x_{i21}&x_{i22}&\cdots&x_{i2p}\\ \vdots&\vdots&\ddots&\vdots\\ x_{im1}&x_{im2}&\cdots&x_{imp}\\ \end{array}\right)\]

These explanatory variables might be time-invariant such that we have a variable like the treatment group. Alternatively, we might have explanatory variables that are time-varying, such as age in the values observed at each time point may change.

6.4.1 Multivariate Normal Probability Model

We first discussed this in Chapter 2, so feel free to return to [Random Vectors and Matrices].

When we represent the outcomes for the \(i\)th unit as a random vector, \(\mathbf{Y}_i\), it is useful to consider a multivariate model such as the multivariate normal probability model.

The joint probability distribution that is the extension of the univariate version to a (\(m \times 1\)) random vector \(\mathbf{Y}\), each of whose components is normally distributed, is given by \[ f(\mathbf{y}) = \frac{1}{(2\pi)^{m/2}}|\Sigma|^{-1/2}\exp\{ -(\mathbf{y} - \mu)^T\Sigma^{-1}(\mathbf{y} - \mu)/2\}\]

  • This probability density function describes the probabilities with which the random vector \(\mathbf{Y}\) takes on values jointly in its \(m\) elements.

  • The form is determined by the mean vector \(\mu\) and covariance matrix \(\Sigma\).

The form of \(f(\mathbf{y})\) depends critically on the term \[(\mathbf{y} -{\mu})^T\Sigma^{-1} (\mathbf{y} -{\mu})\]

The quadratic form of the term pops up in many common methods. You’ll see it in generalized least squares (GLS) and the Mahalanobis distance as a standardized sum of squares. Read the next section if you’d like to think more deeply about this quadratic form.

6.4.1.1 Quadratic Form (Optional)

The pdf of the multivariate normal pdf depends critically on the term \[(\mathbf{y} -{\mu})^T\Sigma^{-1} (\mathbf{y} -{\mu})\]

This is a quadratic form (in linear algebra), so it is a scalar function of the elements of \((\mathbf{y} -\mu)\) and \(\Sigma^{-1}\).

If we refer to the elements of \(\Sigma^{-1}\) as \(\sigma^{jk}\), i.e. \[\Sigma^{-1}=\left(\begin{array}{ccc} \sigma^{11}&\cdots&\sigma^{1m}\\ \vdots&\ddots&\vdots\\ \sigma^{m1}&\cdots&\sigma^{mm}\end{array} \right)\] then we may write \[(\mathbf{y} -{\mu})^T\Sigma^{-1} (\mathbf{y} -{\mu}) = \sum^m_{j=1}\sum^m_{k=1}\sigma^{jk}(y_j-\mu_j)(y_k-\mu_k).\]

Of course, the elements \(\sigma^{jk}\) will be complicated functions of the elements \(\sigma^2_j\), \(\sigma_{jk}\) of \(\Sigma\), i.e. the variances of the \(Y_j\) and the covariances among them.

  • This term thus depends on not only the squared deviations \((y_j - \mu_j)^2\) for each element in \(\mathbf{y}\) (which arise in the double sum when \(j = k\)), but also on the crossproducts \((y_j - \mu_j)(y_k - \mu_k)\). Each contribution of these squares and cross products is standardized by values \(\sigma^{jk}\) that involve the variances and covariances.

  • Thus, although it is quite complicated, one gets the suspicion that the quadratic form has an interpretation, albeit more complex, as a distance measure, just as in the univariate case.

To better understand the multivariate distribution, it is instructive to consider the special case \(m = 2\), the simplest example of a multivariate normal distribution (hence the name bivariate).

Here, \[\mathbf{Y} = \left(\begin{array}{c} Y_1\\ Y_2\end{array} \right), {\mu} = \left(\begin{array}{c} \mu_1\\\mu_2\end{array} \right), \Sigma = \left(\begin{array}{cc} \sigma^2_1&\sigma_{12}\\\sigma_{12}&\sigma^2_2\end{array} \right)\]

Using the inversion formula for a (\(2 \times 2\)) matrix,

\[\Sigma^{-1} = \frac{1}{\sigma^2_1\sigma^2_2 - \sigma^2_{12}}\left(\begin{array}{cc} \sigma^2_2&-\sigma_{12}\\-\sigma_{12}&\sigma^2_1 \end{array}\right)\]

We also have that the correlation between \(Y_1\) and \(Y_2\) is given by \[\rho_{12} = \frac{\sigma_{12}}{\sigma_1\sigma_2}.\]

Using these results, it is an algebraic exercise to show that (try it!) \[ (\mathbf{y} - \mu)^T\Sigma^{-1}(\mathbf{y} - \mu) = \frac{1}{1-\rho^2_{12}}\left\{ \frac{(y_1-\mu_1)^2}{\sigma^2_1}+\frac{(y_2-\mu_2)^2}{\sigma^2_2}-2\rho_{12}\frac{(y_1-\mu_1)}{\sigma_1}\frac{(y_2-\mu_2)}{\sigma_2}\right\}\]

  • One component is the sum of squared standardized values (z-scores) \[ \frac{(y_1-\mu_1)^2}{\sigma^2_1}+\frac{(y_2-\mu_2)^2}{\sigma^2_2}\]

This sum is similar to the sum of squared deviations in least squares, with the difference that each deviation is now weighted by its variance. This makes sense–because the variances of \(Y_1\) and \(Y_2\) differ, information on the population of \(Y_1\) values is of a different quality than that on the population of \(Y_2\) values. If the variance is large, the quality of information is poorer; thus, the larger the variance, the smaller the weight, so that information of higher quality receives more weight in the overall measure. Indeed, this is like a distance measure, where each contribution receives an appropriate weight.

  • In addition, there is an extra term that makes it have a different form than just a sum of weighted squared deviations: \[-2\rho_{12}\frac{(y_1-\mu_1)}{\sigma_1}\frac{(y_2-\mu_2)}{\sigma_2}\]

This term depends on the cross-product, where each deviation is again weighted by its variance. This term modifies the distance measure in a way connected with the association between \(Y_1\) and \(Y_2\) through their cross-product and correlation \(\rho_{12}\). Note that the larger this correlation in magnitude (either positive or negative), the more we modify the usual sum of squared deviations.

  • Note that the entire quadratic form also involves the multiplicative factor \(1/(1 -\rho^2_{12})\), which is greater than 1 if \(|\rho_{12}| > 0\). This factor scales the overall distance measure by the magnitude of the association.

6.5 Failure of Standard Estimation Methods

6.5.1 Ordinary Least Squares

Imagine we have \(n\) units/subjects, and each unit has \(m\) observations over time. Conditional on \(p\) explanatory variables, we typically assume a linear model for the relationship between the explanatory variables and the outcome,

\[Y_{ij} = \beta_0 + \beta_1x_{ij1}+ \cdots+ \beta_px_{ijp} + \epsilon_{ij}\]

If we are using ordinary least squares to estimate the slope parameters, we assume the errors represent independent measurement error, \(\epsilon_{ij} \stackrel{iid}{\sim} \mathcal{N}(0,\sigma^2)\).

This can be written using vector and matrix notation with each unit having its outcome vector, \(\mathbf{Y}_i\),

\[\mathbf{Y}_i = \mathbf{X}_i\boldsymbol{\beta} + \boldsymbol\epsilon_i\text{ for each unit } i\] where \(\boldsymbol\epsilon_i \sim \mathcal{N}(0,\sigma^2I)\) and \(\mathbf{X}_i\) is the observations for the explanatory variables for subject \(i\) with \(m\) rows and \(p+1\) columns (a column of 1’s for the intercept plus \(p\) columns that correspond to the \(p\) variables).

If we stack all of our outcome vectors \(\mathbf{Y}\) (and covariate matrices \(\mathbf{X}\)) on top of each other (think Long Format in R),

\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\]

where \(\boldsymbol{\epsilon} \sim \mathcal{N}(0,\sigma^2I)\) so we are assuming that our errors and thus our data are independent within and between each unit and have constant variance.

6.5.1.1 Estimation

To find the ordinary Least Squares (OLS) estimate \(\widehat{\boldsymbol{\beta}}_{OLS}\), we need to minimize the sum of squared residuals, which can be written with vector and matrix notation,

\[(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})\]

Taking the derivative with respect to \((\beta_0,\beta_1,...,\beta_p)\), we get a set of simultaneous equations expressed in matrix notation as

\[-2\mathbf{X}^T\mathbf{Y} +2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}\]

As long as \(\mathbf{X}^T\mathbf{X}\) has an inverse, then the unique solution and our estimated parameters can be found using

\[\widehat{\boldsymbol{\beta}}_{OLS} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\]

This estimator is unbiased in that it estimates the true value on average, \(E(\widehat{\boldsymbol{\beta}}_{OLS}) = \boldsymbol{\beta}\).

Derivation:

\[E(\widehat{\boldsymbol{\beta}}_{OLS}) = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^TE(\mathbf{Y})\] \[= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\] \[= \boldsymbol{\beta}\]

It also has a minimum variance of all unbiased, linear estimators IF the errors are independent and have constant variance.

Derivation:

\[Cov(\widehat{\boldsymbol{\beta}}_{OLS}) = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T Cov(\mathbf{Y})\{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\}^T\] \[= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\sigma^2\mathbf{I})\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\] \[= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\]

using \(Cov(\mathbf{A}\mathbf{Y}) = \mathbf{A}Cov(\mathbf{Y})\mathbf{A}^T\), which you can verify.

The OLS estimates of our coefficients are fairly good (unbiased), but the OLS standard errors are wrong unless our data are independent. Since we have correlated repeated measures, we don’t have as much “information” as if they were independent observations.

6.5.2 Generalized Least Squares

If you relax the assumption of constant variance, then the covariance matrix of \(\boldsymbol\epsilon_i\) and thus \(\mathbf{Y}_i\) is

\[\boldsymbol{\Sigma}_i = \left(\begin{array}{cccc} \sigma^2_1&0&\cdots&0\\ 0&\sigma^2_2&\cdots &0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\sigma^2_m\end{array}\right)\]

For longitudinal data, this would allow the variability to change over time.

If you relax both the assumptions of constant variance and independence, then the covariance matrix of \(\boldsymbol\epsilon_i\) and thus \(\mathbf{Y}_i\) is

\[\boldsymbol{\Sigma}_i = \left(\begin{array}{cccc} \sigma^2_1&\sigma_{12}&\cdots&\sigma_{1m}\\ \sigma_{21}&\sigma^2_2&\cdots &\sigma_{2m}\\ \vdots&\vdots&\ddots&\vdots\\ \sigma_{m1}&\sigma_{m2}&\cdots&\sigma^2_m\end{array}\right)\]

For longitudinal data, this would allow the variability to change over time, and you can account for dependence between repeated measures.

If we assume individual units are independent of each other (which is typically a fair assumption), the covariance matrix for the entire vector of outcomes (for all units and their observations over time) is written as a block diagonal matrix,

\[\boldsymbol\Sigma = \left(\begin{array}{cccc} \boldsymbol\Sigma_1&\mathbf{0}&\cdots&\mathbf{0}\\ \mathbf{0}&\boldsymbol\Sigma_2&\cdots &\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{0}&\mathbf{0}&\cdots&\boldsymbol\Sigma_n\end{array}\right)\]

6.5.2.1 Estimation

To find the Generalized Least Squares (GLS) estimator \(\widehat{\boldsymbol{\beta}}\), we need to minimize the standardized sum of squared residuals,

\[(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T\boldsymbol{\Sigma}^{-1}(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})\]

If \(\boldsymbol{\Sigma}\) is known, taking the derivative with respect to \(\beta_0,\beta_1,...,\beta_p\), we get a set of simultaneous equations expressed in matrix notation as

\[ -2\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{Y} +2\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X}\boldsymbol{\beta} = \mathbf{0}\]

As long as \(\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X}\) has an inverse, then the generalized least squares (GLS) estimator is

\[\widehat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{Y}\]

6.5.2.1.1 Alternative Derivation of GLS

An alternative approach is to consider transforming our \(\mathbf{Y}\) and \(\mathbf{X}\) by pre-multiplying by the inverse of the lower triangular matrix from the Cholesky Decomposition of the covariance matrix \(\boldsymbol{\Sigma}\) (\(\mathbf{L}\)), and then find the OLS estimator.

We start by transforming our data so that \(\mathbf{Y}^* = \mathbf{L}^{-1}\mathbf{Y}\), \(\mathbf{X}^*\mathbf{L}^{-1}\mathbf{X}\), and \(\boldsymbol\epsilon^* = \mathbf{L}^{-1}\boldsymbol\epsilon\).

We can rewrite the model as,

\[\mathbf{L}^{-1}\mathbf{Y} = \mathbf{L}^{-1}\mathbf{X}\boldsymbol{\beta} + \mathbf{L}^{-1}\boldsymbol\epsilon\] \[\implies \mathbf{Y}^* = \mathbf{X}^*\boldsymbol{\beta} + \boldsymbol\epsilon^*\] Then, using OLS on these transformed vectors, we get the generalized least squares estimator,

\[\implies \widehat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}^{*T}\mathbf{X}^*)^{-1}\mathbf{X}^{*T}\mathbf{Y}^*\] \[= ((\mathbf{L}^{-1}\mathbf{X})^T\mathbf{L}^{-1}\mathbf{X})^{-1}(\mathbf{L}^{-1}\mathbf{X})^T\mathbf{L}^{-1}\mathbf{Y}\] \[= (\mathbf{X}^T(\mathbf{L}^{-1})^T\mathbf{L}^{-1}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{L}^{-1})^T\mathbf{L}^{-1}\mathbf{Y}\] \[= (\mathbf{X}^T(\mathbf{L}^{T})^{-1}\mathbf{L}^{-1}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{L}^{T})^{-1}\mathbf{L}^{-1}\mathbf{Y}\] \[= (\mathbf{X}^T(\mathbf{L}\mathbf{L}^T)^{-1}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{L}\mathbf{L}^T)^{-1}\mathbf{Y}\] \[= (\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{Y}\]

6.5.2.2 Estimation Issues

Main Issue: \(\boldsymbol{\Sigma}\) is unknown!

The solution to this problem is to iterate. Estimate \(\boldsymbol{\beta}\) and use that to estimate \(\boldsymbol{\Sigma}\). Repeat.

Other Issues:

  • If the longitudinal data is balanced, there are \(m(m+1)/2\) parameters in \(\boldsymbol\Sigma_i\) to estimate. That is a lot of parameters (!)
  • If the longitudinal data is unbalanced, there is no common \(\boldsymbol{\Sigma}_i\) for every unit

The solution to these problems involves simplifying assumptions such as

  • assuming the errors are independent (not a great solution)
  • assuming the errors have constant variance (maybe not too bad, depending on the data)
  • assuming there is some restriction/structure on the covariance function (best option)

We approximate the covariance matrix to get better estimates of \(\boldsymbol\beta\).

If we know the correlation structure of our data, we could use generalized least squares to get better (better than OLS) estimates of our coefficients, and the estimated standard errors would be more accurate.

The two main methods for longitudinal data that we’ll learn are similar to generalized least squares in flavor.

However, we need models that are not only applicable to continuous outcomes but also binary or count outcomes. Let’s first learn about the standard cross-sectional approaches to allow for binary and count outcomes (generalized linear models). Then, we’ll be able to talk about marginal models and mixed effects models.

6.6 Generalized Linear Models

When you have outcome data that is not continuous, we can’t use a least squares approach as it is only appropriate for continuous outcomes. However, we can generalize the idea of a linear model to allow for binary or count outcomes. This is called a generalized linear model (GLM). GLM’s extend regression to situations beyond the continuous outcomes with Normal errors (Nelder and Wedderburn 1972), and they are, in fact, a broad class of models for outcomes that are continuous, discrete, binary, etc.

GLM’s requires a three-part specification:

  1. Distributional assumption for \(Y\)
  2. Systematic component with \(\mathbf{X}\)
  3. Link function to relate \(E(Y)\) with systematic component

6.6.1 Distributional Assumption

The first assumption you need to make to fit a GLM is to assume a distribution for the outcome \(Y\).

Many distributions you have learned in probability (normal, Bernoulli, binomial, Poisson) belong to the exponential family of distributions that share a general form and statistical properties. GLM’s are limited to this family of distributions.

One important statistical property of the exponential family is that the variance can be written as a scaled function of the mean,

\[Var(Y) = \phi v(\mu)\quad \text{ where } E(Y) = \mu\]

where \(\phi>0\) is a dispersion or scale parameter and \(v(\mu)\) is a variance function of the mean.

6.6.2 Systematic Component

For a GLM, the mean or a transformed mean can be expressed as a linear combination of explanatory variables, which we’ll notate as \(\eta\):

\[\eta = \beta_0 + \beta_1X_{1}+\beta_2X_{2}+\cdots+\beta_pX_{p}\]

You’ll need to decide which explanatory variables should be used to model the mean. This may include a time variable (e.g., age, time since baseline, etc.) and other unit characteristics that are time-varying or time-invariant. We’ll refer to this as the mean model.

6.7 Marginal Models

A marginal model for longitudinal data has a four-part specification, and the first three parts are similar to a GLM.

6.7.1 Model Specification

  1. A distributional assumption about \(Y_{ij}\). We assume a distribution from the exponential family. We will only use this assumption to determine the mean and variance relationship.

  2. A systematic component which is the mean model, \(\eta_{ij} = \mathbf{x}_{ij}^T\boldsymbol\beta\), which get connected to the mean through

  3. a link function: The conditional expectation \(E(Y_{ij}|x_{ij}) = \mu_{ij}\) is assumed to depend on explanatory variables through a given link function (similar to GLM),

\[g(\mu_{ij}) = \eta_{ij} = \mathbf{x}_{ij}^T\boldsymbol\beta\]

Note: This implies that given \(\mathbf{x}_{ij}\), there is no dependence of \(Y_{ij}\) on \(\mathbf{x}_{ik}\) for \(j\not=k\) (warning: this might not hold if \(Y_{ij}\) predicts \(\mathbf{x}_{ij+1}\)).

The conditional variance of each \(Y_{ij}\), given explanatory variables, depends on the mean according to

\[Var(Y_{ij}|\mathbf{x}_{ij}) = \phi v(\mu_{ij})\]

Note: The \(\phi\) (phi parameter) is a dispersion parameter and scales variance up or down.

  1. Covariance Structure: The conditional within-subject association among repeated measures is assumed to be a function of a set of association parameters \(\boldsymbol\alpha\) (and the means \(\mu_{ij}\)). We choose a working correlation structure for \(Cor(\mathbf{Y}_i)\) from our familiar structures: Compound Symmetry/Exchangeable, Exponential/AR1, Gaussian, Spherical, etc. See [Autocorrelation Function] in Chapter 3 for examples.

Note: We may use the word “association” rather than correlation so that it applies in cases where the outcome is not continuous.

Based on the model specification, the covariance matrix for \(\mathbf{Y}_i\) is

\[\mathbf{V}_i = \mathbf{A}_i^{1/2}Cor(\mathbf{Y}_i)\mathbf{A}_i^{1/2}\]

where \(\mathbf{A}_i\) is a diagonal matrix with \(Var(Y_{ij}|X_{ij}) = \phi v(\mu_{ij})\) along the diagonal. We use the term working correlation structure to acknowledge our uncertainty about the assumed model for the within-subject associations.

6.7.2 Interpretation

We typically discuss how a 1 unit increase in an explanatory variable impacts the mean, keeping all other variables constant.

In a marginal model, we explicitly model the overall or marginal mean, \(E(Y_{ij}|X_{ij})\).

For a continuous outcome (with identity link function), the expected response for the value of \(X_{ij} = x\) is

\[E(Y_{ij}|X_{ij}=x) = \beta_0 +\beta_1x \]

The expected response for value of \(X_{ij} = x+1\) is

\[E(Y_{ij}|X_{ij}=x+1) = \beta_0 +\beta_1(x+1) \]

The difference is \[E(Y_{ij}| X_{ij}=x+1) -E(Y_{ij}|, X_{ij}=x)\] \[= \beta_0 +\beta_1(x+1) - (\beta_0 +\beta_1x) = \beta_1 \]

Therefore, in the context of a marginal model, \(\beta_1\) has the interpretation of a population mean change.

A 1-unit increase in \(X\) leads to a \(\beta_1\) increase in the overall mean response, keeping all other variables constant.

6.7.3 Estimation

To estimate the parameters, we will use generalized estimating equations (GEE). We want to minimize the following objective function (a standardized sum of squared residuals),

\[\sum^n_{i=1}(\mathbf{Y}_i - \boldsymbol\mu_i)^T\mathbf{V}_i^{-1}(\mathbf{Y}_i - \boldsymbol\mu_i)\]

treating \(\mathbf{V}_i\) as known and where \(\boldsymbol\mu_i\) is a vector of means with elements \(\mu_{ij} = g^{-1}(\mathbf{x}_{ij}^T\boldsymbol\beta)\).

Using calculus, it can be shown that if a minimum exists, it must solve the following generalized estimating equations,

\[\sum^n_{i=1}\mathbf{D}^T_i\mathbf{V}_i^{-1}(\mathbf{Y}_i - \boldsymbol\mu_i) = 0\]

where \(\mathbf{D}_i = \partial\boldsymbol\mu_i/\partial\boldsymbol\beta\) is the gradient matrix.

Depending on the link function, there may or may not be a closed-form solution. If not, the solution requires an iterative algorithm.

Because GEE depends on both \(\boldsymbol\beta\), \(\boldsymbol\alpha\) (parameters for the working correlation matrix) and \(\phi\) (variance scalar), the following iterative two-stage estimation procedure is required:

Step 1. Given current estimates of \(\boldsymbol\alpha\) and \(\phi\), \(V_i\) is estimated, and an updated estimate of \(\boldsymbol\beta\) is obtained as the solution to the generalized estimating equations.

Step 2. Given the current estimate of \(\beta\), updated estimates of \(\boldsymbol\alpha\) and \(\phi\) are obtained from the standardized residuals

\[e_{ij} = \frac{Y_{ij} - \hat{\mu}_{ij}}{\sqrt{v(\hat{\mu}_{ij})}}\]

Step 3. We iterate between Steps 1 and 2 until convergence has been achieved (estimates for \(\boldsymbol\beta\), \(\boldsymbol\alpha\), and \(\phi\) don’t change).

Starting or initial estimates of \(\boldsymbol\beta\) can be obtained assuming independence.

Note: In estimating the model, we only use our assumption about the mean model,

\[g(\mu_{ij}) = \eta_{ij} = \mathbf{x}_{ij}^T\boldsymbol\beta\]

and the covariance model of the observations,

\[\mathbf{V}_i = \mathbf{A}_i^{1/2}Cor(\mathbf{Y}_i)\mathbf{A}_i^{1/2}\]

where \(\mathbf{A}_i\) is a diagonal matrix with \(Var(Y_{ij}|X_{ij}) = \phi v(\mu_{ij})\) along the diagonal.

6.7.3.1 R: Three GEE R Packages

There are three packages to do this in R, gee, geepack, and geeM.

We will use geeM when we have large data sets because it is optimized for large data. The geepack package is nice as it has an anova method to help compare models. The gee package has some nice output.

The syntax for the function geem() in geeM package is

geem(formula, id, waves = NULL, data, family = gaussian, constr = "independence", Mv = 1)
  • formula: Symbolic description of the model to be fitted
  • id: Vector that identifies the clusters
  • data: Optional dataframe
  • family: Description of the error distribution and link function
  • constr: A character string specifying the correlation structure. Allowed structures are: “independence”, “exchangeable” (equal correlation), “ar1” (exponential decay), “m-dependent” (m-diagonal), “unstructured”, “fixed”, and “userdefined”.
  • Mv: for “m-dependent”, the value for m.
require(geeM)
Loading required package: geeM
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
summary(geem(resp ~ age + smoke, id = id, data = ohio, family = binomial(link = 'logit'), corstr = 'ar1'))
            Estimates Model SE Robust SE    wald       p
(Intercept)   -1.8980  0.10960   0.11470 -16.550 0.00000
age           -0.1148  0.05586   0.04494  -2.554 0.01066
smoke          0.2438  0.16620   0.17980   1.356 0.17520

 Estimated Correlation Parameter:  0.3992 
 Correlation Structure:  ar1 
 Est. Scale Parameter:  1.017 

 Number of GEE iterations: 3 
 Number of Clusters:  537    Maximum Cluster Size:  4 
 Number of observations with nonzero weight:  2148 

The syntax for the function geeglm() in geepack package is

geeglm(formula, family = gaussian, data, id, zcor = NULL, constr, std.err = 'san.se')
  • formula: Symbolic description of the model to be fitted
  • family: Description of the error distribution and link function
  • data: Optional dataframe
  • id: Vector that identifies the clusters
  • contr: A character string specifying the correlation structure. The following are permitted: “independence”, “exchangeable”, “ar1”, “unstructured” and “userdefined”
  • zcor: Enter a user defined correlation structure
require(geepack)
summary(geeglm(resp ~ age + smoke, id = id, data = ohio, family = binomial(link = 'logit'), corstr = 'ar1'))

Call:
geeglm(formula = resp ~ age + smoke, family = binomial(link = "logit"), 
    data = ohio, id = id, corstr = "ar1")

 Coefficients:
            Estimate  Std.err    Wald Pr(>|W|)    
(Intercept) -1.90218  0.11525 272.409   <2e-16 ***
age         -0.11489  0.04539   6.407   0.0114 *  
smoke        0.23448  0.18119   1.675   0.1956    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = ar1 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    1.021  0.1232
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.491 0.06733
Number of clusters:   537  Maximum cluster size: 4 

The syntax for the function gee() in gee package is

gee(formula, id, data, family = gaussian, constr = 'independence',Mv)
  • formula: Symbolic description of the model to be fitted
  • family: Description of the error distribution and link function
  • data: Optional dataframe
  • id: Vector that identifies the clusters
  • constr: Working correlation structure: “independence”, “exchangeable”, “AR-M”, “unstructured”
  • Mv: order of AR correlation (AR1: Mv = 1)
require(gee)
Loading required package: gee
summary(gee(resp ~ age + smoke, id = id, data = ohio, family = binomial(link = 'logit'), corstr = 'AR-M', Mv = 1))
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept)         age       smoke 
    -1.8837     -0.1134      0.2721 

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     AR-M , M = 1 

Call:
gee(formula = resp ~ age + smoke, id = id, data = ohio, family = binomial(link = "logit"), 
    corstr = "AR-M", Mv = 1)

Summary of Residuals:
    Min      1Q  Median      3Q     Max 
-0.1939 -0.1586 -0.1439 -0.1179  0.8821 


Coefficients:
            Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept)  -1.8982    0.10962 -17.316     0.11468  -16.552
age          -0.1148    0.05586  -2.054     0.04494   -2.554
smoke         0.2438    0.16620   1.467     0.17983    1.356

Estimated Scale Parameter:  1.017
Number of Iterations:  3

Working Correlation
        [,1]   [,2]   [,3]    [,4]
[1,] 1.00000 0.3990 0.1592 0.06352
[2,] 0.39900 1.0000 0.3990 0.15920
[3,] 0.15920 0.3990 1.0000 0.39900
[4,] 0.06352 0.1592 0.3990 1.00000

6.7.3.2 Properties of Estimators

We acknowledge that our working covariance matrix \(\mathbf{V}_i\) only approximates the true underlying covariance matrix for \(\mathbf{Y}_i\), but it is wrong.

Despite incorrectly specifying the correlation structure,

  • \(\hat{\boldsymbol\beta}\) is consistent (meaning that with high probability \(\hat{\boldsymbol\beta}\) is close to the true \(\boldsymbol\beta\) for very large samples) no matter whether the within-in subject associations are correctly modeled.

  • In large samples, the sampling distribution of \(\hat{\boldsymbol\beta}\) is multivariate normal with mean \(\boldsymbol\beta\) and the covariance is a sandwich, \(Cov(\hat{\boldsymbol\beta}) = B^{-1}MB^{-1}\) where the bread \(B\) is defined as \[B = \sum^n_{i=1} \mathbf{D}_i^T\mathbf{V}_i^{-1}\mathbf{D}_i\] and the middle (meat or cheese or filling) \(M\) is defined as \[M = \sum^n_{i=1} \mathbf{D}_i^T\mathbf{V}_i^{-1}Cov(\mathbf{Y}_i)\mathbf{V}_i^{-1}\mathbf{D}_i\]

We can plug in estimates of the quantities and get

\[\widehat{Cov}(\hat{\boldsymbol\beta}) = \hat{B}^{-1}\hat{M}\hat{B}^{-1}\] where \[\hat{B} = \sum^n_{i=1} \hat{\mathbf{D}}_i^T\hat{\mathbf{V}}_i^{-1}\hat{\mathbf{D}}_i\] and \[\hat{M} = \sum^n_{i=1} \hat{\mathbf{D}}_i^T\hat{\mathbf{V}}_i^{-1}\widehat{Cov}(Y_i)\hat{\mathbf{V}}_i^{-1}\hat{\mathbf{D}}_i\] and \(\widehat{Cov}(\mathbf{Y}_i) = (\mathbf{Y}_i - \hat{\boldsymbol\mu}_i)(\mathbf{Y}_i-\hat{\boldsymbol\mu}_i)^T\)

This is the empirical or sandwich robust estimator of the standard errors. It is valid even if we are wrong about the correlation structure.

Fun fact: If we happen to choose the right correlation/covariance structure and \(\mathbf{V}_i = \boldsymbol\Sigma_i\), where \(\boldsymbol\Sigma_i = Cov(\mathbf{Y}_i)\), then \(Cov(\hat{\boldsymbol\beta}) = B^{-1}\).

Since we model the conditional expectation \(E(Y_{ij}|\mathbf{x}_{ij}) = \mu_{ij}\) with

\[g(\mu_{ij}) = \mathbf{x}_{ij}^T\beta,\]

we can only make overall population interpretations regarding mean response at a given value of \(\mathbf{x}_{ij}\).

We can compare two lists of \(\mathbf{x}\) values, \(\mathbf{x}^*\) and \(\mathbf{x}\) and see how that impacts the population mean via the link function \(g(\mu)\),

\[g(\mu^*) - g(\mu) = \mathbf{x}^{*T}\beta - \mathbf{x}^{T}\beta\]

6.7.4 Model Selection Tools and Diagnostics

The model selection tools we use will be similar to those used in Stat 155. To refresh your memory of how hypothesis testing can be used for model selection, please see my Introduction to Statistical Models Notes.

6.7.4.1 Hypothesis Testing for Coefficients (one at a time)

This section is similar to the t-tests in Stat 155.

To decide which explanatory variables should be in the model, we can test whether \(\beta_k=0\) for some \(k\). If the slope coefficient were equal to zero, the associated variable is essentially not contributing to the model.

In marginal models (GEE), the estimated covariance matrix for \(\hat{\boldsymbol\beta}\) is given by the robust sandwich estimator. The standard error for \(\hat{\beta}_k\) is the square rooted values of the diagonal of this matrix corresponding to \(\hat{\beta}_k\). This is reported in the summary output for Robust SE.

To test the hypothesis \(H_0: \beta_k=0\) vs. \(H_A: \beta_k\not=0\), we can calculate a z-statistic (referred to as a Wald statistic),

\[z = \frac{\hat{\beta}_k}{SE(\hat{\beta}_k)}\]

With GEE, the geem() function provides wald statistics and associated p-values for testing \(H_0: \beta_k= 0\). These p-values come from a Normal distribution because we know that is the asymptotic distribution for these estimators.

geemod <- geem(resp ~ age + smoke, id = id, data = ohio, family = binomial(link = 'logit'), corstr = 'exchangeable')
summary(geemod) #look for Robust SE and wald
            Estimates Model SE Robust SE    wald        p
(Intercept)   -1.8800  0.11480   0.11390 -16.510 0.000000
age           -0.1134  0.04354   0.04386  -2.585 0.009726
smoke          0.2651  0.17700   0.17770   1.491 0.135900

 Estimated Correlation Parameter:  0.3541 
 Correlation Structure:  exchangeable 
 Est. Scale Parameter:  0.9999 

 Number of GEE iterations: 3 
 Number of Clusters:  537    Maximum Cluster Size:  4 
 Number of observations with nonzero weight:  2148 
geemod$beta/sqrt(diag(geemod$var)) #wald statistics by hand
(Intercept)         age       smoke 
    -16.510      -2.585       1.491 

6.7.4.2 Hypothesis Testing for Coefficients (many at one time)

This section is similar to the nested F-tests in Stat 155.

To test a more involved hypothesis \(H_0: \mathbf{L}\boldsymbol\beta=0\) (to test whether multiple slopes are zero or a linear combo of slopes is zero), we can calculate a Wald statistic (a squared z statistic),

\[W^2 = (\mathbf{L}\hat{\boldsymbol\beta})^T(\mathbf{L}\widehat{Cov}(\hat{\boldsymbol\beta})\mathbf{L}^T)^{-1}(\mathbf{L}\hat{\boldsymbol\beta})\]

We assume the sampling distribution is approximately \(\chi^2\) with df = # of rows of \(\mathbf{L}\) to calculate p-values (as long as \(n\) is large).

This model selection tool is useful when you have a categorical variable with more than two categories. You may want to check to see if we should exclude the entire variable or whether a particular category has an impact on the model.

With the data example, let’s look at smoking. It is only a binary categorical variable, but let’s run the hypothesis test with the example code from above. So, in this case, we let the matrix \(L\) be equal to a \(1\times 3\) matrix with 0’s everywhere except in the 3rd column so that we’d test: \(H_0: \beta_2 = 0\). Even though we used a different test statistic, we ended up with the same p-value as the approach above.

summary(geemod)
            Estimates Model SE Robust SE    wald        p
(Intercept)   -1.8800  0.11480   0.11390 -16.510 0.000000
age           -0.1134  0.04354   0.04386  -2.585 0.009726
smoke          0.2651  0.17700   0.17770   1.491 0.135900

 Estimated Correlation Parameter:  0.3541 
 Correlation Structure:  exchangeable 
 Est. Scale Parameter:  0.9999 

 Number of GEE iterations: 3 
 Number of Clusters:  537    Maximum Cluster Size:  4 
 Number of observations with nonzero weight:  2148 
b = geemod$beta
W = geemod$var

(L = matrix(c(0,0,1),nrow=1))
     [,1] [,2] [,3]
[1,]    0    0    1
L%*%b # Lb estimate
       [,1]
[1,] 0.2651
(se = sqrt(diag(L%*%W%*%t(L)))) # Robust SE for Lb
[1] 0.1777
# 95% Confidence Interval (using Asymptotic Normality)
L%*%b - 1.96*se
        [,1]
[1,] -0.0833
L%*%b + 1.96*se
       [,1]
[1,] 0.6135
# Hypothesis Testing

w2 <- as.numeric( t(L%*%b) %*% solve(L %*% W %*% t(L))%*% (L%*%b)) ## should be approximately chi squared
1 - pchisq(w2, df = nrow(L)) # p-value
[1] 0.1359

So, the p-value, the probability of getting a test statistic as or more extreme assuming the null hypothesis is true, is around 13%. We do not have enough evidence to reject the null hypothesis. This leads us to believe that smoking could be removed from the model.

Let’s test whether \(H_0: \beta_1 = \beta_2 = 0\), then we let the matrix \(L\) be equal to a \(2\times 3\) matrix with 0’s everywhere except in the 2nd column in row 1 and 3rd column in row 2.

summary(geemod)
            Estimates Model SE Robust SE    wald        p
(Intercept)   -1.8800  0.11480   0.11390 -16.510 0.000000
age           -0.1134  0.04354   0.04386  -2.585 0.009726
smoke          0.2651  0.17700   0.17770   1.491 0.135900

 Estimated Correlation Parameter:  0.3541 
 Correlation Structure:  exchangeable 
 Est. Scale Parameter:  0.9999 

 Number of GEE iterations: 3 
 Number of Clusters:  537    Maximum Cluster Size:  4 
 Number of observations with nonzero weight:  2148 
b = geemod$beta
W = geemod$var

(L = matrix(c(0,1,0,0,0,1),nrow=2,byrow=TRUE))
     [,1] [,2] [,3]
[1,]    0    1    0
[2,]    0    0    1
L%*%b #Lb estimate
        [,1]
[1,] -0.1134
[2,]  0.2651
(se = sqrt(diag(L%*%W%*%t(L)))) # Robust SE for Lb
[1] 0.04386 0.17775
# 95% Confidence Interval (using Asymptotic Normality)
L%*%b - 1.96*se
        [,1]
[1,] -0.1993
[2,] -0.0833
L%*%b + 1.96*se
         [,1]
[1,] -0.02743
[2,]  0.61346
# Hypothesis Testing

w2 <- as.numeric( t(L%*%b) %*% solve(L %*% W %*% t(L))%*% (L%*%b)) ## should be approximately chi squared
1 - pchisq(w2, df = nrow(L)) #p-value
[1] 0.01092

We have evidence to suggest that at least one of the variables, age or smoking, has a statistically significant effect on the outcome. As we saw above, we know it must be age, but the model with age and smoking is better than a model with no explanatory variables.

6.7.4.3 Diagnostics

For any linear model with a quantitative outcome and at least one quantitative explanatory variable, we should check to see if the residuals have a pattern.

Ideally, the residual plot should have no obvious pattern, plotting the residuals on the y-axis against a quantitative x-variable. See some example code below.

resid = mod$y - predict(mod)
plot(predict(mod),resid)

If there is a pattern, then we are systematically over or under-predicting, and we can use that information to incorporate non-linearity to improve the model.

6.8 Mixed Effects

Let’s return to a continuous outcome setting. Consider the mean response for a specific individual (\(i\)th individual/unit) at a specific observation time (\(j\)th time).

\[ \mu_{ij} = E(Y_{ij}|\text{ Explanatory Variables, Time})\]

  • If the mean is constant across observation times (horizontal trend line) and other variables (no X’s), and the same across all individuals, \[\mu_{ij} = \mu\]

  • If the mean changes linearly across observation times and no other variables impact the mean, and it is the same across all individuals, \[\mu_{ij} = \beta_0 + \beta_1 t_{ij}\]

  • If the mean increases linearly with an time-varying explanatory variable and it is the same across all individuals, \[\mu_{ij} = \beta_0 + \beta_1 x_{ij}\]

6.8.1 Individual Intercepts

In any situations listed above, individuals may have their own intercept, their own starting level. We can notate that by adding \(b_i\) as an individual deviation from the mean intercept,

\[\mu_{ij} = \mu + b_i\]

\[\mu_{ij} = \beta_0 + b_i + \beta_1 t_{ij}\]

\[\mu_{ij} = \beta_0 + b_i + \beta_1 x_{ij}\]

We could estimate the individual intercepts to find \(b_i\), the difference between the overall mean intercept and an individual’s intercept. If \(b_i = -2\), the individual \(i\) starts -2 units lower in their outcome responses than the average.

6.8.1.1 First Approach: Separate Models

Let’s use the dental data to figure out the overall mean intercept and estimate each deviation from that intercept. The first approach is to fit a separate linear model for each person and look at the intercepts.

dental <- read.table("./data/dental.dat", col.names=c("obsno", "id", "age", "distance", "gender"))
int <- sapply(unique(dental$id),function(i) lm(distance~age, data = dental[dental$id == i,])$coef[1])
mean(int) #mean intercept
[1] 16.76
int-mean(int) #estimated b_i
(Intercept) (Intercept) (Intercept) (Intercept) (Intercept) (Intercept) 
     0.4889     -2.5611     -2.3611      2.8889      2.8389      0.2389 
(Intercept) (Intercept) (Intercept) (Intercept) (Intercept) (Intercept) 
     0.1889      4.6889      1.3389     -3.2111      2.1889      0.5389 
(Intercept) (Intercept) (Intercept) (Intercept) (Intercept) (Intercept) 
    -1.9111     -0.7611      7.9389     -3.1111      2.1889     -1.8111 
(Intercept) (Intercept) (Intercept) (Intercept) (Intercept) (Intercept) 
     2.9889     -2.3611      4.4889      3.2889     -3.5111    -13.9611 
(Intercept) (Intercept) (Intercept) 
     2.3389     -3.2611      0.1889 

However, this stratification approach also allows everyone to have their own slope for age, which we may not want to do. If there is some common growth across persons, we’d like to borrow information across all persons and fit only one model.

6.8.1.2 Second Approach: Fixed Effects Models

To estimate a model with one fixed slope for everyone that allows for individual intercepts, we fit a model often termed a fixed effects model. In this model, we include an indicator variable for every individual id to allow for individual intercepts. This can be quite difficult if we have a large sample size because we are estimating one parameter per person or unit.

lm(distance ~ age + factor(id), data = dental)

Call:
lm(formula = distance ~ age + factor(id), data = dental)

Coefficients:
 (Intercept)           age   factor(id)2   factor(id)3   factor(id)4  
       14.11          0.66          1.62          2.37          3.50  
 factor(id)5   factor(id)6   factor(id)7   factor(id)8   factor(id)9  
        1.25         -0.25          1.62          2.00         -0.25  
factor(id)10  factor(id)11  factor(id)12  factor(id)13  factor(id)14  
       -2.88          5.00          6.37          2.00          2.87  
factor(id)15  factor(id)16  factor(id)17  factor(id)18  factor(id)19  
        5.25          1.62          5.00          2.37          2.50  
factor(id)20  factor(id)21  factor(id)22  factor(id)23  factor(id)24  
        3.75          8.12          2.25          2.87          2.87  
factor(id)25  factor(id)26  factor(id)27  
        3.50          4.50          1.62  

6.8.1.3 Third Approach: Random Effects Models

An alternative way for individuals to have their own intercept is to assume a probability distribution for the $ b_i$’s such as \(N(0,\sigma_b^2)\) and estimate the variability,\(\sigma^2_b\). This is a random intercept model,

\[Y_{ij} = \beta_0 + b_i + \beta_1x_{ij} + \epsilon_{ij},\quad \quad b_i\stackrel{iid}{\sim} N(0,\sigma^2_b), \epsilon_{ij}\stackrel{iid}{\sim} N(0,\sigma^2_e),\] where \(\epsilon_{ij}\) and \(b_i\) are independent for any \(i\) and \(j\).

library(lme4)
summary(lmer(distance ~ age + (1|id), data = dental))
Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ age + (1 | id)
   Data: dental

REML criterion at convergence: 447

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.665 -0.535 -0.013  0.487  3.722 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 4.47     2.11    
 Residual             2.05     1.43    
Number of obs: 108, groups:  id, 27

Fixed effects:
            Estimate Std. Error t value
(Intercept)  16.7611     0.8024    20.9
age           0.6602     0.0616    10.7

Correlation of Fixed Effects:
    (Intr)
age -0.845

In the output above, you’ll see information about the standardized or scaled residuals.

Below it, you’ll see information on the random effects. In our case, we have a random intercept for each individual that we allowed by adding + (1 | id) to the formula. The 1 indicates intercept, ’ |indicates conditional on, andidrefers to the variable calledid`. We assumed that \(b_i \sim N(0,\sigma^2_b)\) and it has estimated \(\hat{\sigma}^2_b = 4.472\), \(\hat{\sigma}_b = 2.115\). Additionally, we have assumed that our errors are independent with constant variance, \(\hat{\sigma}^2_e = 2.049\), \(\hat{\sigma}_e = 1.432\).

Lastly, we have our fixed effects, the parameters we don’t assume are random, \(\beta_0\) and \(\beta_1\). Those estimates are \(\hat{\beta}_0 = 16.76\) and \(\hat{\beta}_1 = 0.66\).

Now with this random effects model, the addition of a random intercept has induced correlation between observation within an individual because they now share an intercept.

In fact, assuming the model is true,

\[Cov(Y_{ij},Y_{il}) = Cov(\beta_0+b_i +\beta_1x_{ij}+\epsilon_{ij},\beta_0+b_i +\beta_1x_{il}+\epsilon_{il})\] \[= Cov(b_i +\epsilon_{ij},b_i + \epsilon_{il})\] \[= Cov(b_i, b_i) + Cov(b_i,\epsilon_{ij}) + Cov(b_i, \epsilon_{il}) + Cov(\epsilon_{ij},\epsilon_{il})\] \[= \sigma^2_b + 0+0+0\]

\[Cor(Y_{ij},Y_{il}) = \frac{Cov(Y_{ij},Y_{il}) }{\sqrt{Var(Y_{ij})Var(Y_{il})}}\] \[ = \frac{\sigma^2_b }{\sqrt{Var(\beta_0+b_i +\beta_1X_{ij}+\epsilon_{ij})Var(\beta_0+b_i +\beta_1X_{il}+\epsilon_{il})}}\] \[ = \frac{\sigma^2_b }{\sqrt{Var(b_i +\epsilon_{ij})Var(b_i +\epsilon_{il})}}\] \[ = \frac{\sigma^2_b }{\sqrt{(\sigma^2_b + \sigma^2_e)^2}}\]

\[ = \frac{\sigma^2_b }{(\sigma^2_b + \sigma^2_e)}\]

for \(j\not=l\) and this does not depend on \(j\) or \(l\) or \(j-l\). The correlation is constant for any time lags (exchangeable/compound symmetric).

6.8.2 Individual Slopes

Since we have longitudinal data, we can observe each individual’s trajectory over time. Everyone has their own unique growth rate. If that growth rate can be explained with an explanatory variable, we could use an interaction term to allow for that unique growth. Otherwise, we could allow each individual to estimate their own slope.

6.8.2.1 Second Approach: Fixed Effects Models

One could use a fixed effects model and use an interaction term with the id variable to estimate each individual slope, but that is not sustainable for larger data sets.

lm(distance ~ age*factor(id), data = dental)

Call:
lm(formula = distance ~ age * factor(id), data = dental)

Coefficients:
     (Intercept)               age       factor(id)2       factor(id)3  
        1.73e+01          3.75e-01         -3.05e+00         -2.85e+00  
     factor(id)4       factor(id)5       factor(id)6       factor(id)7  
        2.40e+00          2.35e+00         -2.50e-01         -3.00e-01  
     factor(id)8       factor(id)9      factor(id)10      factor(id)11  
        4.20e+00          8.50e-01         -3.70e+00          1.70e+00  
    factor(id)12      factor(id)13      factor(id)14      factor(id)15  
        5.00e-02         -2.40e+00         -1.25e+00          7.45e+00  
    factor(id)16      factor(id)17      factor(id)18      factor(id)19  
       -3.60e+00          1.70e+00         -2.30e+00          2.50e+00  
    factor(id)20      factor(id)21      factor(id)22      factor(id)23  
       -2.85e+00          4.00e+00          2.80e+00         -4.00e+00  
    factor(id)24      factor(id)25      factor(id)26      factor(id)27  
       -1.45e+01          1.85e+00         -3.75e+00         -3.00e-01  
 age:factor(id)2   age:factor(id)3   age:factor(id)4   age:factor(id)5  
        4.25e-01          4.75e-01          1.00e-01         -1.00e-01  
 age:factor(id)6   age:factor(id)7   age:factor(id)8   age:factor(id)9  
       -1.32e-15          1.75e-01         -2.00e-01         -1.00e-01  
age:factor(id)10  age:factor(id)11  age:factor(id)12  age:factor(id)13  
        7.50e-02          3.00e-01          5.75e-01          4.00e-01  
age:factor(id)14  age:factor(id)15  age:factor(id)16  age:factor(id)17  
        3.75e-01         -2.00e-01          4.75e-01          3.00e-01  
age:factor(id)18  age:factor(id)19  age:factor(id)20  age:factor(id)21  
        4.25e-01          1.47e-15          6.00e-01          3.75e-01  
age:factor(id)22  age:factor(id)23  age:factor(id)24  age:factor(id)25  
       -5.00e-02          6.25e-01          1.58e+00          1.50e-01  
age:factor(id)26  age:factor(id)27  
        7.50e-01          1.75e-01  

6.8.2.2 Third Approach: Random Effects Models

A more efficient way to allow individuals to have their own slopes is to assume a probability distribution for the slopes (and typically intercepts) by assuming the vector of individual intercept and slope is bivariate Normal with covariance matrix \(G\), \((b_{i0},b_{i1}) \sim N(0,G)\). This is a random slope and intercept model,

\[Y_{ij} = (\beta_0 + b_{i0}) + (\beta_1 + b_{i1})x_{ij} + \epsilon_{ij},\quad \quad (b_{i0},b_{i1}) \sim N(0,G), \epsilon_{ij}\stackrel{iid}{\sim} N(0,\sigma^2_e),\] where \(\epsilon_{ij}\) and \((b_{i0}, b_{i1})\) are independent of each other for any \(i\) and \(j\).

summary(lmer(distance ~ age + (age|id), data = dental))
Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ age + (age | id)
   Data: dental

REML criterion at convergence: 442.6

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.223 -0.494  0.007  0.472  3.916 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 id       (Intercept) 5.4166   2.327         
          age         0.0513   0.226    -0.61
 Residual             1.7162   1.310         
Number of obs: 108, groups:  id, 27

Fixed effects:
            Estimate Std. Error t value
(Intercept)  16.7611     0.7753   21.62
age           0.6602     0.0713    9.27

Correlation of Fixed Effects:
    (Intr)
age -0.848

In the output above, you’ll see information about the standardized or scaled residuals.

Below it, you’ll see information on the random effects. In our case, we have a random intercept and a random slope for age for each individual that we allowed by adding + (age | id) to the formula. We might read (age | id) as we want to estimate slope | indicates conditional on, and id refers to the variable called id. We assumed that \((b_{i0}, b_{i1}) \sim N(0,G)\). The estimated covariance matrix \(G\) is

\[\hat{G} = \left(\begin{array}{cc} 5.41 & -0.61*2.32*0.22\\ -0.61*2.32*0.22& 0.05 \end{array}\right)\]

Additionally, we have assumed that our errors are independent with constant variance, \(\hat{\sigma}^2_e = 1.72\), \(\hat{\sigma}_e = 1.31\).

Lastly, we have our fixed effects, the parameters we don’t assume are random, \(\beta_0\) and \(\beta_1\). Those estimates are \(\hat{\beta}_0 = 16.76\) and \(\hat{\beta}_1 = 0.66\). You’ll notice that these estimates did not change much at all by adding a random intercept. This will more often happen when modeling a continuous outcome (not when modeling a binary or count outcome).

6.8.3 Multi-level or Hierarchical Model

We can write the random intercept and slope model by creating layers of models. We can write the model for the mean as the level 1 model

\[\mu_{ij} = \beta_{i0} + \beta_{i1} X_{ij}\quad\text{ Level 1}\]

where the individual intercepts and slopes can be written as a level 2 model,

\[\beta_{i0} = \beta_{0} + b_{i0}\quad\text{ Level 2}\] and/or \[\beta_{i1} = \beta_{1} + b_{i1} \quad\text{ Level 2}\]

where \(b_{i0}\sim N(0,\tau^2_0)\), \(b_{i1}\sim N(0,\tau^2_1)\), \(Cov(\delta_{i0},\delta_{i1})=\tau_{01}\).

This is often called a multi-level or hierarchical model in that the model is written with more than one level of models.

Combine the levels by plugging in the second level into the first level.

\[Y_{ij} = \beta_{0} + b_{i0} +(\beta_{1} + b_{i1} ) x_{ij} + \epsilon_{ij}\]

\[ = (\beta_{0} + \beta_{1}x_{ij}) + (b_{i1} x_{ij} +b_{i0})+ \epsilon_{ij}\]

\[ = \text{Fixed effects} + \text{Random effects} + \text{Error}\]

This composite model is often called a mixed effects model because there are fixed-parameter effects and random parameter effects.

6.8.4 Mixed Effects Model

In general, we can write a mixed-effects model for our outcome vector,

\[\mathbf{Y}_{i} = \mathbf{X}_{i}\boldsymbol\beta + \mathbf{Z}_i\mathbf{b}_i + \boldsymbol\epsilon_i\]

\[ = \underbrace{\text{Fixed effects}}_{\text{mean model}} + \underbrace{\text{Random effects}}_{\text{between unit variation}} + \underbrace{text{Error}}_{\text{within unit variation}}\]

where \(\mathbf{X}_i\) is the design matrix for the fixed effects, \(\mathbf{Z}_i\) is the design matrix for the random effects (a subset of columns of \(\mathbf{X}_i\)), \(\boldsymbol\epsilon_i\sim N(0,\boldsymbol\Sigma)\), and \(\mathbf{b}_i\sim N(0,\mathbf{G})\).

\[E(\mathbf{Y}_{i}) = E(\mathbf{X}_{i}\boldsymbol\beta + \mathbf{Z}_i\mathbf{b}_i + \boldsymbol\epsilon_i)\] \[= \mathbf{X}_{i}\boldsymbol\beta\]

\[Cov(\mathbf{Y}_{i}) = Cov(\mathbf{X}_{i}\boldsymbol\beta + \mathbf{Z}_i\mathbf{b}_i + \boldsymbol\epsilon_i)\] \[= Cov(\mathbf{Z}_i\mathbf{b}_i + \boldsymbol\epsilon_i)\]

Assuming \(\mathbf{b}_i\) and \(\boldsymbol\epsilon_i\) are independent, then

\[\mathbf{V}_i = Cov(\mathbf{Y}_{i}) = Cov(\mathbf{Z}_i\mathbf{b}_i) + Cov(\boldsymbol\epsilon_i)\] \[= \mathbf{Z}_iCov(\mathbf{b}_i)\mathbf{Z}_i^T + \boldsymbol\Sigma\] \[= \mathbf{Z}_i\mathbf{G}\mathbf{Z}_i^T + \boldsymbol\Sigma\]

Since \(\mathbf{b}_i\) and \(\boldsymbol\epsilon_i\) are almost always assumed Normal, then

\[\mathbf{Y}_{i}\sim \mathcal{N}(\mathbf{X}_{i}\boldsymbol\beta,\mathbf{Z}_i\mathbf{G}\mathbf{Z}_i^T + \boldsymbol\Sigma)\]

6.8.5 History

The origins of mixed effects models go back to R.A. Fisher work under the analysis of variance (ANOVA) paradigm in the 1910-1930’s.

  • Earliest mixed effects model: random intercept model
  • Work continued in the ANOVA framework until 1970’s
  • Mixed effects models in a linear model framework are based on the ANOVA paradigm
  • Seminal paper by Harville in 1977
  • Highly cited mixed models paper by Laird and Ware in 1982

6.8.6 Interpretation

We typically discuss how a 1 unit increase in a variable in X impacts the mean, keeping all other variables constant.

With random effects, this means interpreting the change within a unit or subject.

The expected response, given subject’s random effects, is

\[E(\mathbf{Y}_{i}|\mathbf{b}_i) = \mathbf{X}_i\boldsymbol\beta + \mathbf{Z}_i\mathbf{b}_i \]

For the sake of simplicity, let’s consider the random intercept model, \[E(Y_{ij}|b_i) = \beta_0 +b_i+\beta_1X_{ij} \]

The expected response for value of \(X_{ij} = x\) is

\[E(Y_{ij}|b_i, X_{ij}=x) = \beta_0 +b_i+\beta_1x \]

The expected response for value of \(X_{ij} = x+1\) is

\[E(Y_{ij}|b_i, X_{ij}=x+1) = \beta_0 +b_i+\beta_1(x+1) \]

The difference is in the expected response for individual \(i\) with random effect \(b_i\) is,

\[E(Y_{ij}|b_i, X_i=x+1) -E(Y_{ij}|b_i, X_i=x)\] \[= \beta_0 +b_i+\beta_1(x+1) - (\beta_0 +b_i+\beta_1x) = \beta_1 \]

This means we can give a subject-specific interpretation!

We can also consider the overall or marginal mean, \(E(Y_{ij})\).

First, a definition of conditional expectation in the discrete setting, \[E(Y|B = b) = \sum_b b \cdot P(Y = y|B = b)\]

Then we can show that

\[E(Y) = E(E(Y|B=b))\]

The iterated expectation also holds for continuous random variables.

We can also consider the overall or marginal mean, \(E(Y_{ij})\).

\[E(Y_{ij}) = E(E(Y_{ij}|b_i)) = \beta_0 +\beta_1X_{ij} \]

The expected response for value of \(X_{ij} = x\) is

\[E(Y_{ij}|X_{ij}=x) = \beta_0 +\beta_1x \]

The expected response for value of \(X_{ij} = x+1\) is

\[E(Y_{ij}|X_{ij}=x+1) = \beta_0 +\beta_1(x+1) \]

The difference is \[E(Y_{ij}| X_{ij}=x+1) -E(Y_{ij}|, X_{ij}=x)\] \[= \beta_0 +\beta_1(x+1) - (\beta_0 +\beta_1x) = \beta_1 \]

Therefore, in the context of a linear mixed effects model, \(\beta_1\) has the interpretation of the subject-specific change as well as the population mean change.

A 1-unit increase in \(X\) leads to a \(\beta_1\) increase in the subject’s mean response, keeping all other variables constant.

A 1-unit increase in \(X\) leads to a \(\beta_1\) increase in the overall mean response, keeping all other variables constant.

6.8.7 Estimation

6.8.7.1 Maximum Likelihood Estimation

Mixed Effects Models require specifying the entire distribution. We assume probability models for the random effects and the errors, and thus we could use maximum likelihood estimation to find estimates for our slopes.

If we specify the REML argument in lmer as REML = FALSE, then the lmer function will maximize the log-likelihood function (based on Normal assumptions for the errors and random effects),

\[l = -\frac{m*n}{2}\log(2\pi) - \frac{1}{2}\sum^n_{i=1} \log|\mathbf{V}_i|\] \[-\frac{1}{2}\{\sum^n_{i=1}(\mathbf{y}_i - \mathbf{X}_i\boldsymbol\beta)^T\mathbf{V}_i^{-1}(\mathbf{y}_i - \mathbf{X}_i\boldsymbol\beta) \}\]

where \(\mathbf{V}_i\) is the composite covariance matrix.

  • Estimates of \(\ beta\)’s are unbiased if the model is correct (random effects and mean model).
  • All parameter estimates are consistent if the model is correct.
  • All parameter estimates are asymptotically unbiased; as \(n\rightarrow\infty\), they become unbiased (if model is correct).
  • Estimates are asymptotically Normal; as \(n\rightarrow\infty\), the sampling distribution becomes more Normal.
  • Estimates of variance of random effects are biased with finite \(n\).

6.8.7.2 Restricted Maximum Likelihood Estimation

Patterson and Thompson (1971) and Harville (1974) provided technical solutions to the issues of maximum likelihood. They suggested maximizing the likelihood of observing the sample residuals rather than the sample data. This is referred to as restricted maximum likelihood or REML and is implemented in lmer by default when REML = TRUE.

REML Algorithm:

  • Estimate fixed effects using OLS.
  • Write down the likelihood of residuals in terms of residuals and variance parameters.
  • Then maximize likelihood with respect to variance parameters.

Another REML Algorithm:

  • Split the likelihood into one part about the mean and one part about the variance.
  • First, maximize the variance part to get estimates of the variance parameters
  • Then maximize the part about the mean using the estimated variance.

After some complicated mathematics, you ultimately end up maximizing the following with respect the parameters of \(\mathbf{V}_i\)

\[l = -\frac{m*n}{2}\log(2\pi) - \frac{1}{2}\sum^n_{i=1} \log|\mathbf{V}_i|\] \[-\frac{1}{2}\left\{\sum^n_{i=1}(\mathbf{y}_i - \mathbf{X}_i\hat{\boldsymbol\beta})^T\mathbf{V}_i^{-1}(\mathbf{y}_i - \mathbf{X}_i\hat{\boldsymbol\beta})\right\}\] \[- \frac{1}{2}\log |\sum^n_{i=1}\mathbf{X}_i^T\mathbf{V}_i^{-1}\mathbf{X}_i|\]

and \(\hat{\boldsymbol\beta}= (\sum^n_{i=1}\mathbf{X}_i^T\mathbf{V}_i\mathbf{X}_i)^{-1}(\sum^n_{i=1}\mathbf{X}_i^T\mathbf{V}_i\mathbf{Y}_i)\) are the GLS estimates.

Advantages of REML

  • Less bias in variance estimators.
  • \(\hat{\boldsymbol\beta}\) is still unbiased as long as the model is correct.

Disadvantages of REML

  • You can only compare models that differ in their variance components with REML.
  • You cannot compare models with differing fixed effects because you didn’t maximize the full information (use ML for this).

6.8.8 Model Selection

6.8.8.1 Hypothesis Testing for Fixed Effects (one at a time)

To decide which fixed effects should be in the model, we can test whether \(\beta_k=0\).

The estimated covariance matrix for \(\hat{\boldsymbol\beta}\) is \[\widehat{Cov}(\hat{\boldsymbol\beta}) = \left\{ \sum^n_{i=1} \mathbf{X}_i^T\hat{\mathbf{V}}_i\mathbf{X}_i \right\}^{-1}\]

The standard error for \(\hat{\beta}_k\) is the square rooted values of the diagonal of this matrix corresponding to \(\hat{\beta}_k\). This is what is reported in the summary output.

mod.randomslope <- lmer(distance~age + (age|id), data = dental)
lme4::fixef(mod.randomslope)
(Intercept)         age 
    16.7611      0.6602 
vcov(mod.randomslope)
2 x 2 Matrix of class "dpoMatrix"
            (Intercept)       age
(Intercept)     0.60105 -0.046855
age            -0.04685  0.005077
summary(mod.randomslope)
Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ age + (age | id)
   Data: dental

REML criterion at convergence: 442.6

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.223 -0.494  0.007  0.472  3.916 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 id       (Intercept) 5.4166   2.327         
          age         0.0513   0.226    -0.61
 Residual             1.7162   1.310         
Number of obs: 108, groups:  id, 27

Fixed effects:
            Estimate Std. Error t value
(Intercept)  16.7611     0.7753   21.62
age           0.6602     0.0713    9.27

Correlation of Fixed Effects:
    (Intr)
age -0.848

To test the hypothesis \(H_0: \beta_k=0\) vs. \(H_A: \beta_k\not=0\), we can calculate a z-statistic,

\[z = \frac{\hat{\beta}_k}{SE(\hat{\beta}_k)}\]

There is debate about the appropriate distribution of this statistic, which is why p-values are not reported in the output. However, if you have an estimate that is large relative to the standard error, that indicates that it is significantly different from zero.

To test a more involved hypothesis \(H_0: \mathbf{L}\boldsymbol\beta=0\) (for multiple rows), we can calculate a Wald statistic,

\[W^2 = (\mathbf{L}\hat{\boldsymbol\beta})^T(\mathbf{L}\hat{Cov}(\hat{\boldsymbol\beta})\mathbf{L}^T)^{-1}(\mathbf{L}\hat{\boldsymbol\beta})\]

Then we assume the sampling distribution is approximately \(\chi^2\) with df = # of rows of \(\mathbf{L}\) to calculate p-values (as long as \(n\) is large). Below is some example R code.

b = fixef(mod1)
W = vcov(mod1)

L = matrix(c(0,0,0,1),nrow=1)

L%*%b
(se = sqrt(diag(L%*%W%*%t(L)))) ##Robust SE's for Lb

##95% Confidence Interval (using Asymptotic Normality)
L%*%b - 1.96*se
L%*%b + 1.96*se

##Hypothesis Testing
w2 <- as.numeric( t(L%*%b) %*% solve(L %*% W %*% t(L))%*% (L%*%b)) ## should be approximately chi-squared
1 - pchisq(w2, df = nrow(L)) #p-value

Let’s consider the data on the guinea pigs. In particular, we fit a model with a random intercept and dose, time, and the interaction between dose and time. If we want to know if the interaction term is necessary, we need to test whether both of their slopes for the interaction term are equal to 0.

pigs <- read.table("./data/diet.dat", col.names=c("id", paste("bw.",c(1,3,4,5,6,7),sep=""), "dose" ))
pigs <- pigs %>%
  gather(Tmp,bw, bw.1:bw.7) %>%
  separate(Tmp,into=c('Var','time'), remove=TRUE) 

pigs$time <- as.numeric(pigs$time)
pigs$dose <- factor(pigs$dose)
levels(pigs$dose) <- c("Zero dose",' Low dose', 'High dose')

mod.pigs <- lmer(bw ~ dose*time + (1|id), data = pigs)
summary(mod.pigs)
Linear mixed model fit by REML ['lmerMod']
Formula: bw ~ dose * time + (1 | id)
   Data: pigs

REML criterion at convergence: 843.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.5621 -0.4107  0.0169  0.6843  2.1099 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 1347     36.7    
 Residual              703     26.5    
Number of obs: 90, groups:  id, 15

Fixed effects:
                   Estimate Std. Error t value
(Intercept)          469.69      20.15   23.31
dose Low dose          4.84      28.50    0.17
doseHigh dose         11.15      28.50    0.39
time                  16.03       2.45    6.53
dose Low dose:time     6.53       3.47    1.88
doseHigh dose:time     3.60       3.47    1.04

Correlation of Fixed Effects:
            (Intr) dsLwds dsHghd time   dsLds:
doseLowdose -0.707                            
doseHighdos -0.707  0.500                     
time        -0.528  0.373  0.373              
dosLowds:tm  0.373 -0.528 -0.264 -0.707       
dosHghds:tm  0.373 -0.264 -0.528 -0.707  0.500
b = lme4::fixef(mod.pigs)
W = vcov(mod.pigs)
(L <- matrix(c(rep(0,4),1,0,rep(0,5),1),nrow=2,byrow=TRUE))
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0    1    0
[2,]    0    0    0    0    0    1
w2 <-as.numeric( t(L%*%b) %*% solve(L %*% W %*% t(L))%*% (L%*%b))
1- pchisq(w2, df = nrow(L)) #We don't have enough evidence to reject null (thus, don't need interaction term)
[1] 0.1695

Lastly, another option is a likelihood ratio test. Suppose we have two models with the same random effects and covariance models.

  • Full model: \(M_f\) has \(p\) columns in \(\mathbf{X}_i\) and thus \(p\) fixed effects \(\beta_0,...,\beta_{p-1}\).
  • Nested model: \(M_n\) has \(k\) columns such that \(k < p\) and \(p-k\) fixed effects \(\beta_l = 0\).

If we use maximum likelihood estimation, it makes sense to compare the likelihood of two models.

\(H_0:\) nested model is true, \(H_A:\) full model is true

If we take the ratio of the likelihoods from the nested model and full model and plug in the maximum likelihood estimators, then we have another statistic.

\[D = -2\log\left(\frac{L_n(\hat{\boldsymbol\beta},\hat{\mathbf{V}})}{L_f(\hat{\boldsymbol\beta},\hat{\mathbf{V}})}\right) = -2 \log(L_n(\hat{\boldsymbol\beta},\hat{\mathbf{V}})) + 2\log(L_f(\hat{\boldsymbol\beta},\hat{\mathbf{V}}))\]

The sampling distribution of this statistic is approximately chi-squared with degrees of freedom equal to the difference in the number of parameters between the two models.

mod.pigsML <- lmer(bw ~ dose*time + (1|id), data = pigs, REML=FALSE)
mod.pigsML2 <- lmer(bw ~ dose+time + (1|id), data = pigs, REML=FALSE)

anova(mod.pigsML,mod.pigsML2)
Data: pigs
Models:
mod.pigsML2: bw ~ dose + time + (1 | id)
mod.pigsML: bw ~ dose * time + (1 | id)
            npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
mod.pigsML2    6 892 907   -440       880                    
mod.pigsML     8 893 913   -438       877  3.61  2       0.16
(Dstat <- -2*logLik(mod.pigsML2)+2*logLik(mod.pigsML))
'log Lik.' 3.609 (df=6)
1-pchisq(Dstat,df = 2) #df = difference in number of parameters
'log Lik.' 0.1645 (df=6)

6.8.8.2 Information Criteria for Choosing Fixed Effects

To use BIC to choose models, you must use maximum likelihood estimation instead of REML.

mod.pigsML <- lmer(bw ~ dose*time + (1|id), data = pigs, REML=FALSE)
summary(mod.pigsML)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: bw ~ dose * time + (1 | id)
   Data: pigs

      AIC       BIC    logLik -2*log(L)  df.resid 
    892.9     912.9    -438.4     876.9        82 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-2.656 -0.405  0.034  0.684  2.167 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 1059     32.5    
 Residual              675     26.0    
Number of obs: 90, groups:  id, 15

Fixed effects:
                   Estimate Std. Error t value
(Intercept)          469.69      18.52   25.36
dose Low dose          4.84      26.19    0.18
doseHigh dose         11.15      26.19    0.43
time                  16.03       2.40    6.66
dose Low dose:time     6.53       3.40    1.92
doseHigh dose:time     3.60       3.40    1.06

Correlation of Fixed Effects:
            (Intr) dsLwds dsHghd time   dsLds:
doseLowdose -0.707                            
doseHighdos -0.707  0.500                     
time        -0.563  0.398  0.398              
dosLowds:tm  0.398 -0.563 -0.281 -0.707       
dosHghds:tm  0.398 -0.281 -0.563 -0.707  0.500
BIC(mod.pigsML)
[1] 912.9
mod.pigsML2 <- lmer(bw ~ dose+time + (1|id), data = pigs, REML=FALSE)
summary(mod.pigsML2)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: bw ~ dose + time + (1 | id)
   Data: pigs

      AIC       BIC    logLik -2*log(L)  df.resid 
    892.5     907.5    -440.2     880.5        84 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8167 -0.4878 -0.0196  0.6569  2.1617 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 1053     32.5    
 Residual              708     26.6    
Number of obs: 90, groups:  id, 15

Fixed effects:
              Estimate Std. Error t value
(Intercept)     455.05      16.50   27.58
dose Low dose    33.13      21.65    1.53
doseHigh dose    26.77      21.65    1.24
time             19.40       1.42   13.64

Correlation of Fixed Effects:
            (Intr) dsLwds dsHghd
doseLowdose -0.656              
doseHighdos -0.656  0.500       
time        -0.374  0.000  0.000
BIC(mod.pigsML2) #This model has a lower BIC
[1] 907.5

6.8.8.3 Hypothesis Testing for Random Effects

The main way to compare models with different random effects is to compare two models with the same fixed effects. Then use a likelihood ratio test with ML to compare the two models’ fit. Again, the null hypothesis is that the smaller model (less complex model – fewer parameters) is the true model. If we see a higher log-likelihood with the more complex model, we may get a small p-value suggesting that we reject the null hypothesis in favor of the more complex model.

Below, I compare two models fit to the guinea pig data. They have the same fixed effects, but I allowed one to have a random slope and intercept while the other only has a random slope. By comparing these two models, we find that the one with a random slope and intercept better fits the data. Thus we reject the null hypothesis that the model with the random intercept is true in favor of the model with the random slope and intercept.

mod.pigsML2 <- lmer(bw ~ dose+time + (1|id), data = pigs, REML=FALSE)
mod.pigsML3 <- lmer(bw ~ dose+time + (time|id), data = pigs, REML=FALSE)

anova(mod.pigsML2, mod.pigsML3)
Data: pigs
Models:
mod.pigsML2: bw ~ dose + time + (1 | id)
mod.pigsML3: bw ~ dose + time + (time | id)
            npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)    
mod.pigsML2    6 892 907   -440       880                        
mod.pigsML3    8 879 899   -432       863  17.3  2    0.00017 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.8.9 Predicting Random Effects

The best predictor of the random effects is

\[E(\mathbf{b}_i|\mathbf{Y}_i) = \mathbf{G}\mathbf{Z}_i^T\mathbf{V}_i^{-1}(\mathbf{Y}_i - \mathbf{X}_i\hat{\boldsymbol\beta})\]

This is the best linear unbiased predictor (BLUP).

Therefore,

\[\hat{\mathbf{b}}_i = \hat{\mathbf{G}}\mathbf{Z}_i^T\hat{\mathbf{V}_i}^{-1}(\mathbf{Y}_i - \mathbf{X}_i\hat{\boldsymbol\beta})\]

which is called the empirical BLUP. We can get this with R’s ranef() function.

lme4::ranef(mod.randomslope)$id
   (Intercept)       age
1      -0.4859 -0.178215
2      -1.0120  0.009873
3      -0.7730  0.050657
4       1.0694 -0.029879
5       0.5170 -0.167976
6      -0.6205 -0.186561
7      -0.1877 -0.068856
8       1.2507 -0.174430
9      -0.2908 -0.218052
10     -2.2816 -0.250575
11      1.2178  0.083180
12      1.0516  0.215684
13     -0.7276  0.014519
14     -0.1740  0.035857
15      3.0010 -0.065931
16     -1.1769  0.025619
17      1.2178  0.083180
18     -0.6081  0.034911
19      0.8605 -0.094755
20     -0.4446  0.135924
21      2.6535  0.211123
22      0.8907 -0.118846
23     -0.9982  0.114586
24     -4.1305  0.413754
25      0.9045 -0.014133
26     -0.5352  0.208199
27     -0.1877 -0.068856
hist(lme4::ranef(mod.randomslope)$id$`(Intercept)`)

hist(lme4::ranef(mod.randomslope)$id$age)

We can see here that the predicted random effects are not necessarily Normally distributed (which was an assumption we made in fitting the model).

6.8.10 Predicting Outcomes

Thus our best prediction of \(\mathbf{Y}_i\) is

\[\hat{\mathbf{Y}}_i = \mathbf{X}_i\hat{\boldsymbol\beta} +\mathbf{Z}_i\hat{\mathbf{b}}_i\]

which can be rewritten as the weighted average of the population marginal mean, \(\mathbf{X}_i\hat{\boldsymbol\beta}\) and the individual response \(\mathbf{Y}_i\),

\[\hat{\mathbf{Y}}_i = (\hat{\boldsymbol\Sigma}\hat{\mathbf{V}}_i^{-1})\mathbf{X}_i\hat{\boldsymbol\beta} +(I - \hat{\boldsymbol\Sigma}\hat{\mathbf{V}}_i^{-1})\mathbf{Y}_i\]

predict(mod.randomslope)
    1     2     3     4     5     6     7     8     9    10    11    12    13 
20.13 21.09 22.06 23.02 21.11 22.45 23.79 25.13 21.67 23.10 24.52 25.94 22.87 
   14    15    16    17    18    19    20    21    22    23    24    25    26 
24.13 25.39 26.65 21.22 22.20 23.18 24.17 19.93 20.88 21.82 22.77 21.30 22.49 
   27    28    29    30    31    32    33    34    35    36    37    38    39 
23.67 24.85 21.90 22.87 23.84 24.81 20.01 20.89 21.78 22.66 17.76 18.58 19.39 
   40    41    42    43    44    45    46    47    48    49    50    51    52 
20.21 23.93 25.41 26.90 28.39 24.82 26.57 28.32 30.07 21.43 22.78 24.13 25.48 
   53    54    55    56    57    58    59    60    61    62    63    64    65 
22.16 23.55 24.94 26.33 24.52 25.70 26.89 28.08 21.07 22.44 23.81 25.19 23.93 
   66    67    68    69    70    71    72    73    74    75    76    77    78 
25.41 26.90 28.39 21.71 23.10 24.49 25.88 22.15 23.28 24.41 25.54 22.69 24.28 
   79    80    81    82    83    84    85    86    87    88    89    90    91 
25.87 27.46 26.39 28.13 29.87 31.61 21.98 23.07 24.15 25.23 21.96 23.51 25.06 
   92    93    94    95    96    97    98    99   100   101   102   103   104 
26.61 21.22 23.37 25.52 27.67 22.83 24.13 25.42 26.71 23.17 24.91 26.65 28.38 
  105   106   107   108 
21.30 22.49 23.67 24.85 

6.8.11 Generalized Linear Mixed Effects Models

We can generalize the linear mixed effects model by using a link function, \(g()\), to connect the linear model (with random effects)

\[g(E(\mathbf{Y}_{i})) = \mathbf{X}_i\boldsymbol\beta + \mathbf{Z}_i\mathbf{b}_i\] where \[ \mathbf{b}_i \sim N(0,G)\]

Similar to GLM’s, we assume an outcome distribution from the exponential family that determined the relationship between the mean and the variance,

\[Var(Y_{ij}|\mathbf{b}_{i}) = \phi v(E(Y_{ij}|\mathbf{b}_i))\]

summary(glmer(resp ~ age + smoke + (1|id), data = ohio, family = binomial))
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: resp ~ age + smoke + (1 | id)
   Data: ohio

      AIC       BIC    logLik -2*log(L)  df.resid 
   1597.9    1620.6    -794.9    1589.9      2144 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-1.403 -0.180 -0.158 -0.132  2.518 

Random effects:
 Groups Name        Variance Std.Dev.
 id     (Intercept) 5.49     2.34    
Number of obs: 2148, groups:  id, 537

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.374      0.275  -12.27   <2e-16 ***
age           -0.177      0.068   -2.60   0.0093 ** 
smoke          0.415      0.287    1.45   0.1484    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
      (Intr) age   
age    0.227       
smoke -0.419 -0.010