4 Model Components

As we introduced at the beginning of the course, most models we use for data are written in terms of model components of a deterministic or systematic component plus noise or error,

\[Y_t = \underbrace{f(x_t)}_\text{deterministic} + \underbrace{\epsilon_t}_\text{random noise}\]

The deterministic component would generally give the average outcome as a function of \(x_t\), which could represent time/space and/or explanatory variables measured across time/space.

In Stat 155 (Introduction to Statistical Modeling), we model the deterministic component with linear combinations of explanatory variables, \(f(\mathbf{x}) = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p\), and we assume the noise was independent.
In Stat 253 (Statistical Machine Learning), we learn parametric and nonparametric tools to model the deterministic component (polynomials, splines, local regression, KNN, trees/forests, etc.) for prediction.

We’ll build on your existing knowledge and learn a few more tools for modeling the deterministic components that arise with correlated data. We’ll use the notation of time series to discuss model components, but these ideas apply to longitudinal and spatial data.

Typically, we write a time series model with three components (not just two) and model them separately:

\[Y_t = \underbrace{f(x_t)}_\text{deterministic trend} + \underbrace{s(x_t)}_\text{deterministic seasonality} + \underbrace{\epsilon_t}_\text{random noise}\]

Let’s define these model components and the types of questions that are driving the modeling.

Trend: This is the long-range average outcome pattern over time. We often ask ourselves, “Is the data generally increasing or decreasing over time? How is it changing? Linearly? Exponentially?”
Seasonality: This refers to cyclical patterns related to calendar time, such as seasons, quarters, months, days of the week, or time of day, etc. We might wonder, “Are there daily cycles, weekly cycles, monthly cycles, annual cycles, and/or multi-year cycles?” (e.g., amount of sunshine has a daily and annual cycle, the climate has multi-year El Nino and La Nina climate cycles on top of annual seasonality of temperature)
Noise: There is high-frequency variability in the outcome not attributed to trend or seasonality. In other words, noise is what is left over. We might break up this noise into two noise components:

serial correlation: the size and direction of the noise today is likely to be similar tomorrow
independent measurement error: due to natural variation in the measurement device

We often ask, “Are there structures/patterns in the noise so that we could use probability to model the serial correlation? Is there a certain range of time or space in which we have dependence? Is the magnitude of the variability constant across time?”

4.1 Trend

Let’s start by focusing on the trend component. Our goal here is to estimate the overall outcome pattern over time.

We’ll plot the residuals (what is left over) over time to see if we’ve successfully estimated and removed the trend. We’ll see if the average residual is fairly constant across time. If not, we must try a different approach to model/remove the trend.

4.1.1 Parametric Approaches

4.1.1.1 Global Transformations

If the trend, \(f(x_t)\), is linearly increasing or decreasing in time (with no seasonality), then we could use a linear regression model to estimate the trend with the following model,

\[Y_t = \beta_0 + \beta_1x_t + \epsilon_t\] where \(x_t\) is some measure of time (e.g., age, time from baseline, etc.).

If the overall mean trend is quadratic, we could include a \(x_t^2\) term in the regression model,

\[Y_t = \beta_0 + \beta_1x_t+ \beta_2x_t^2 + \epsilon_t\]

Any standard regression technique for non-linear relationships (polynomials, splines, etc.) could also be used here to model the trend. As discussed in Chapter 1, we can use ordinary least squares (OLS), lm() in R, to get an unbiased estimate of these coefficients in linear regression. But remember, the standard errors and subsequent inference (p-values, confidence intervals) may not be valid due to correlated noise.

Let’s look at some example data and see how we can model the trend.

Data Example 1

Below are the quarterly earnings per share of the company Johnson and Johnson from 1960-1980 in the astsa package (Stoffer 2025). The first observation is the 1st quarter of 1960 (\(t=1\)) and the second is the 2nd quarter of 1960 (\(t=2\))

library(astsa) # R package for time series
jj

          Qtr1      Qtr2      Qtr3      Qtr4
1960  0.710000  0.630000  0.850000  0.440000
1961  0.610000  0.690000  0.920000  0.550000
1962  0.720000  0.770000  0.920000  0.600000
1963  0.830000  0.800000  1.000000  0.770000
1964  0.920000  1.000000  1.240000  1.000000
1965  1.160000  1.300000  1.450000  1.250000
1966  1.260000  1.380000  1.860000  1.560000
1967  1.530000  1.590000  1.830000  1.860000
1968  1.530000  2.070000  2.340000  2.250000
1969  2.160000  2.430000  2.700000  2.250000
1970  2.790000  3.420000  3.690000  3.600000
1971  3.600000  4.320000  4.320000  4.050000
1972  4.860000  5.040000  5.040000  4.410000
1973  5.580000  5.850000  6.570000  5.310000
1974  6.030000  6.390000  6.930000  5.850000
1975  6.930000  7.740000  7.830000  6.120000
1976  7.740000  8.910000  8.280000  6.840000
1977  9.540000 10.260000  9.540000  8.729999
1978 11.880000 12.060000 12.150000  8.910000
1979 14.040000 12.960000 14.850000  9.990000
1980 16.200000 14.670000 16.020000 11.610000

class(jj) # ts stands for time series R class

[1] "ts"

plot(jj) # plot() is smart enough to make the right x-axis because jj is ts object

We see that the earnings are overall increasing over time, but a bit faster than a linear relationship (there is some curvature in the trend). Let’s transform this dataset into a data frame to use ggplot2, an alternative graphing package in R.

library(ggplot2)
library(dplyr)

jjTS <- data.frame(
  Value = as.numeric(jj), 
  Time = time(jj), # time() works on ts objects
  Quarter = cycle(jj)) # cycle() works on ts objects

quadTrendMod <- lm(Value ~ poly(Time, degree = 2, raw = TRUE), data = jjTS) # poly() fits a polynomial trend model

jjTS <- jjTS %>%
  mutate(Trend = predict(quadTrendMod)) # Add the estimated trend to the dataset

jjTS %>%
  ggplot(aes(x = Time, y = Value)) +
  geom_point() +
  geom_line() +
  geom_line(aes(x = Time, y = Trend), color = 'blue', size = 1.5) +
  theme_classic()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.

Data Example 2

Below are the monthly counts (in thousands) of live births in the United States from 1948 to 1979. The first measurement is from January 1948 (\(t=1\)), then February 1948 (\(t=2\)), etc.

birth

     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1948 295 286 300 278 272 268 308 321 313 308 291 296
1949 294 273 300 271 282 285 318 323 313 311 291 293
1950 297 273 294 259 276 294 316 325 315 312 292 301
1951 304 282 313 296 313 307 328 334 329 329 304 312
1952 312 300 317 292 300 311 345 350 344 336 315 323
1953 322 296 315 287 307 321 354 356 348 334 320 340
1954 332 302 324 305 318 329 359 363 359 352 335 342
1955 329 306 332 309 326 325 354 367 362 354 337 345
1956 339 325 345 309 315 334 370 383 375 370 344 355
1957 346 317 348 331 345 348 380 381 377 376 348 356
1958 344 320 347 326 343 338 361 368 378 374 347 358
1959 349 323 358 331 338 343 374 380 377 368 346 358
1960 338 329 347 327 335 336 370 399 385 368 351 362
1961 358 333 356 335 348 346 374 386 384 372 343 346
1962 346 318 359 328 333 329 366 373 367 363 337 346
1963 355 314 343 322 336 327 362 366 361 358 327 330
1964 336 326 337 316 331 331 359 350 356 347 328 336
1965 315 292 322 291 302 310 330 335 333 318 305 313
1966 301 281 302 291 297 291 311 319 317 317 296 307
1967 295 265 300 271 291 290 310 318 310 304 285 288
1968 277 260 282 274 288 287 308 312 306 304 282 305
1969 284 273 286 284 294 288 315 322 317 309 295 306
1970 300 275 301 292 298 306 326 332 329 328 308 324
1971 299 284 306 290 292 285 295 306 317 305 294 287
1972 278 261 275 256 270 264 265 284 284 275 269 275
1973 259 244 267 255 260 253 267 277 277 264 255 260
1974 261 238 257 246 254 255 273 276 286 283 261 276
1975 264 243 259 250 262 253 280 288 270 273 241 266
1976 257 242 266 241 252 250 281 278 286 278 260 272
1977 274 256 276 259 273 272 297 296 290 282 262 275
1978 262 251 285 260 272 265 296 312 289 282 274 281
1979 277

class(birth) # ts stands for time series R class

[1] "ts"

plot(birth) # plot() is smart enough to make the right x-axis because birth is ts object

Looking at the plot (time-ordered points are connected with a line segment), we note a positive linear relationship from about 1948 to 1960 and a somewhat negative relationship between 1960 to 1968, with another slight increase and then a drop around 1972. This is not a polynomial trend over time. We’ll need another approach!

Let’s also transform this dataset into a data frame to use ggplot2.

birthTS <- data.frame(
  Year = floor(as.numeric(time(birth))), # time() works on ts objects
  Month = as.numeric(cycle(birth)), # cycle() works on ts objects
  Value = as.numeric(birth))

birthTS <- birthTS %>%
  mutate(DecimalTime = Year + (Month-1)/12)

birthTS %>% 
  ggplot(aes(x = DecimalTime, y = Value)) + 
  geom_line() + 
  theme_classic()

4.1.1.2 Splines

Let’s try a regression basis spline, which is a smooth piecewise polynomial. This means we break up the x-axis into intervals using knots or breakpoints and model a different polynomial in each interval. In particular, let’s fit a quadratic polynomial with knots/breaks at 1965 and 1972.

library(splines)
  
TrendSpline <- lm(Value ~ bs(DecimalTime, degree = 2, knots = c(1965,1972)), data = birthTS)
summary(TrendSpline)


Call:
lm(formula = Value ~ bs(DecimalTime, degree = 2, knots = c(1965, 
    1972)), data = birthTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.178 -12.789  -0.974  12.350  49.973 

Coefficients:
                                                    Estimate Std. Error t value
(Intercept)                                          274.249      3.527  77.762
bs(DecimalTime, degree = 2, knots = c(1965, 1972))1  119.206      6.817  17.486
bs(DecimalTime, degree = 2, knots = c(1965, 1972))2   25.459      3.961   6.428
bs(DecimalTime, degree = 2, knots = c(1965, 1972))3  -20.140      6.115  -3.294
bs(DecimalTime, degree = 2, knots = c(1965, 1972))4    7.062      6.006   1.176
                                                    Pr(>|t|)    
(Intercept)                                          < 2e-16 ***
bs(DecimalTime, degree = 2, knots = c(1965, 1972))1  < 2e-16 ***
bs(DecimalTime, degree = 2, knots = c(1965, 1972))2 4.02e-10 ***
bs(DecimalTime, degree = 2, knots = c(1965, 1972))3  0.00108 ** 
bs(DecimalTime, degree = 2, knots = c(1965, 1972))4  0.24039    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.46 on 368 degrees of freedom
Multiple R-squared:  0.7287,    Adjusted R-squared:  0.7258 
F-statistic: 247.1 on 4 and 368 DF,  p-value: < 2.2e-16

birthTS <- birthTS %>%
  mutate(SplineTrend = predict(TrendSpline))

birthTS %>%
  ggplot(aes(x = DecimalTime, y = Value)) +
  geom_point() +
  geom_line() +
  geom_line(aes(x = DecimalTime, y = SplineTrend), color = 'blue', size = 1.5) + 
  theme_classic()

Then let’s see what is left over by plotting the residuals after removing the estimated trend. Is the mean constant? If so, we can ensure we’ve estimated and removed the long-range trend.

# Plotting residuals
birthTS %>%
  ggplot(aes(x = DecimalTime, y = resid(TrendSpline))) + 
  geom_line() + 
  geom_smooth(se = FALSE) + 
  theme_classic()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

These parametric approaches give us an estimated function to plug in times and get a predicted trend value, but they might not quite pick up the full trend.

What are alternative methods for estimating non-linear trends?

4.1.2 Nonparametric Approaches

Nonparametric methods for estimating a trend use algorithms (steps of a process) rather than parametric functions (linear, polynomial, splines) with a finite number of parameters to estimate.

4.1.2.1 Local Regression

Have you ever wondered how geom_smooth() estimates that blue smooth curve?

If you look at the documentation for geom_smooth(), you’ll see that it uses the loess() function for smaller data sets. This trend estimation method is locally estimated scatterplot smoothing, also called local regression.

The loess algorithm is to predict the outcome (estimate the trend) at a point on the x-axis, \(x_0\), following the steps below:

Distance: calculate the distance between \(x_0\) and each observed points \(x\), \(d(x,x_0)=|x-x_0|\).
Find Local Area: keep the closest \(s\) proportion of observed data points (\(s\) determined by span parameter)
Fit Model: fit a linear or quadratic regression model to the local area, weighting points close to \(x_0\) higher (weighting determined by a chosen weighting kernel function)
Predict: use that local model to predict the outcome at \(x_0\)

That process is repeated for a sequence of values along the x-axis to make a plot like the one below (blue: loess, purple: spline).

birthTS %>%
  ggplot(aes(x = DecimalTime, y = Value)) +
  geom_point() +
  geom_line() +
  geom_smooth(se = FALSE, color = 'blue') + # uses loess
  geom_line(aes(x = DecimalTime, y = SplineTrend), color = 'purple', size = 1.5) + 
  theme_classic()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can also get the estimated trend with the loess() function.

# Running loess separately
LocalReg <- loess(Value ~ DecimalTime, data = birthTS, span = .75)
birthTS <- birthTS %>%
  mutate(LoessTrend = predict(LocalReg))

birthTS %>%
  ggplot(aes(x = DecimalTime, y = Value)) +
  geom_point() +
  geom_line() +
  geom_line(aes(x = DecimalTime, y = LoessTrend), color = 'blue', size = 1.5) + 
  geom_line(aes(x = DecimalTime, y = SplineTrend), color = 'purple', size = 1.5) + 
  theme_classic()

# Plotting residuals
birthTS %>%
  ggplot(aes(x = DecimalTime, y = resid(LocalReg))) + 
  geom_line() + 
  geom_smooth(se = FALSE) + 
  theme_classic()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

4.1.2.2 Moving Average Filter

Local regression is an extension of a simple idea called a moving average filter (See Chatfield and Xing 2019 for a brief introduction to moving averages and linear filtering in general).

Imagine a window that only shows you what is “local” or right in front of you, and we can adjust the window width to include more or less data. In the illustration below, we have drawn a 2-year window around January 1960. When considering information around Jan 1960, the window limits our view to only those one year before and one year after (from Jan 1959-Jan 1961).

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

For our circumstances, the “local” window will include an odd number of observations (number of points = \(1+2a\) for a given \(a\) value). Among that small local group of observations, we take the average to estimate the mean at the center of those points (rather than fit a linear regression model),

\[\hat{f}(x_t) = (1+2a)^{-1}\sum^a_{k=-a}y_{t+k}\]

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Then, we move the window and calculate the mean within that window. Continue until you have a smooth curve estimate for each point in the time range of interest.

Box <- data.frame(x = 1965, y = 325, w = 2) #2-year window around 1980
Mean <- birthTS %>%
  filter(DecimalTime < 1965 + 1 & DecimalTime > 1965 - 1) %>%
  summarize(M = mean(Value)) %>% #Mean value of points in the window
  mutate(x = 1965)

birthTS %>% 
  ggplot(aes(x = DecimalTime, y = Value)) + 
  geom_line() + 
  geom_smooth(se = FALSE) + 
  geom_tile(data = Box, aes(x = x,y=y,width=w),height = 100, fill=alpha("grey",0), colour = 'yellow',size = 1) +
  geom_point(data = Mean, aes(x = x, y = M), colour = 'yellow', size = 2)+
  theme_classic()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This method works well for windows of odd-numbered length (equal number of points around the center), but it can be adjusted if you desire an even-numbered length window by adding 1/2 of the two extreme lags so that the middle value lines up with time \(t\).

For the above monthly data example, we might want a 12-point moving average, so we would have

\[\hat{f}(x_t) = \frac{0.5y_{t-6}+y_{t-5}+y_{t-4}+y_{t-3}+y_{t-2}+y_{t-1}+y_{t}+y_{t+1}+y_{t+2}+y_{t+3}+y_{t+4}+y_{t+5}+0.5y_{t+6} }{12}\]

fltr <- c(0.5, rep(1, 11), 0.5)/12 # weights for the weighted average above
birthTrend <- stats::filter(birth, filter = fltr, method = 'convo', sides = 2) #use the ts object rather than the data frame
plot(birthTrend)

Now, if we increase the window’s width (24-point moving window), you should see a smoother curve.

\[\hat{f}(x_t) = \frac{0.5y_{t-12}+ y_{t-11}+...+y_{t-1}+y_{t}+y_{t+1}+...+y_{t+11}+0.5y_{t+12} }{24}\]

fltr <- c(0.5, rep(1, 23), 0.5)/24 # weights for moving average
birthTrend <- stats::filter(birth, filter = fltr, method = 'convo', sides = 2) # use the ts object rather than the data frame
plot(birthTrend)

What if we did a 5-point moving average instead of a 12 or 24-point moving average?

fltr <- rep(1, 5)/5
birthTrend2 <- stats::filter(birth, filter = fltr, method = 'convo', sides = 2)
plot(birthTrend2)

We see the seasonality (repeating cycles) because the 5-point window averages about half of the year’s data for each estimate. Thus, the highs and lows don’t balance each other out.

You must be mindful of the window width for the moving average, depending on the context of the data and the existing cycles in the data.

Let’s see what is left over. Is the mean of the residuals constant across time (indicating that we have fully removed the trend)?

birthTS <- birthTS %>% 
  mutate(Trend = birthTrend) %>%
  mutate(Residual = Value - Trend)

birthTS %>%
  ggplot(aes(x = DecimalTime, y = Residual)) + 
  geom_point() +
  geom_smooth(se = FALSE) + 
  theme_classic()

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

If we connect the points, we might see a cyclic pattern…

birthTS %>%
  ggplot(aes(x = DecimalTime, y = Residual)) + 
  geom_line() +
  geom_smooth(se = FALSE) + 
  theme_classic()

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

We still have to deal with seasonality (we’ll get there), but the moving average does a fairly good job at removing the trend because we see that the trend of the residuals is 0 on average!

The moving average filter did a decent job of estimating the trend because of its flexibility, but we can’t write down a function statement for that trend. This is the downside of nonparametric methods.

4.1.2.3 Differencing to Remove Trend

If we don’t necessarily care about estimating the trend for prediction, we could use differencing to explicitly remove a trend.

We’ll talk about first and second-order differences, which are applicable if we have a linear or quadratic trend.

Linear Trend - First Difference

If we have a linear trend such that \(Y_t = b_0 + b_1 t + \epsilon_t\), the difference between neighboring outcomes (1 time unit apart) will essentially remove that trend.

Take the outcome at time \(t\) and an outcome that lags behind one time unit:

\[Y_t - Y_{t-1} = (b_0 + b_1 t + \epsilon_t) - (b_0 + b_1 (t-1) + \epsilon_{t-1})\] \[= b_0 + b_1 t + \epsilon_t - b_0 - b_1t + b_1 - \epsilon_{t-1}\] \[= b_1 + \epsilon_t - \epsilon_{t-1}\] which is constant with respect to time, \(t\).

See below for a simulated example.

t <- 1:36
y <- ts(2 + 5*t + rnorm(36,sd = 10), frequency=12)
plot(y)

plot(diff(y, lag = 1, differences = 1))

Quadratic Trend - Second Order Differences

If we have a quadratic relationship such that \(Y_t = b_0 + b_1 t+ b_2 t^2 + \epsilon_t\), then taking the second order difference between neighboring outcomes will remove that trend.

Take the outcome at time \(t\) and an outcome that lags behind one time unit (first difference): \[(Y_t - Y_{t-1}) = (b_0 + b_1 t+ b_2 t^2 + \epsilon_t) - (b_0 + b_1 (t-1)+ b_2 (t-1)^2 + \epsilon_{t-1})\] \[= b_0 + b_1 t + b_2 t^2 + \epsilon_t - b_0 - b_1 x_t + b_1 - b_2 t^2 + 2b_2t - b_2^2 - \epsilon_{t-1}\] \[= b_1- b_2^2 + 2b_2t + \epsilon_t - \epsilon_{t-1}\] and thus, the second-order difference is the difference in the neighboring first differences:

\[(Y_t - Y_{t-1}) - (Y_{t-1} - Y_{t-2}) = (b_1- b_2^2 + 2b_2t + \epsilon_t - \epsilon_{t-1}) - (b_1- b_2^2 + 2b_2(t-1) + \epsilon_{t-1} - \epsilon_{t-2})\] \[= 2b_2 + \epsilon_t - 2\epsilon_{t-1} + \epsilon_{t-2}\] which is constant with respect to time, \(t\).

See below for an example of simulated data.

t <- 1:36
y <- ts(2 + 1.5*t + .3*t^2 + rnorm(36,sd = 10), frequency=12)
plot(y)

plot(diff(y, lag = 1, differences = 1)) #first difference

plot(diff(y, lag = 1, differences = 2)) #second order difference

Let’s see what we get when we use differencing on the real birth count data. It looks like the first differences resulted in data with no overall trend despite the complex trend over time!

birthTS %>%
  mutate(Diff = c(NA, diff(Value, lag = 1, differences = 1))) %>%
  ggplot(aes(x = DecimalTime, y = Diff)) + 
  geom_point() + 
  geom_line() + 
  geom_hline(yintercept = 0) + 
  geom_smooth(se = FALSE) + 
  theme_classic()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

4.1.3 In Practice: Estimate vs. Remove

We have a few methods to estimate the trend,

Parametric Regression: Global Transformations (polynomials)
Parametric Regression: Spline basis
Nonparametric: Local Regression
Nonparametric: Moving Average Filter

Among the estimation methods, two are parametric in that we estimate slope parameters or coefficients (regression techniques), and two are nonparametric in that we do not have any parameters to estimate but rather focus on the whole function.

Things to consider when choosing an estimation method:

How well does it approximate the true underlying trend? Does it systematically under- or overestimate the trend for certain periods, or are the residuals on average zero across time?
Do you want to use past trends to predict the future? If so, you may want to consider a parametric approach.

We have two approaches to removing the trend,

Estimate, then Residuals. Estimate the trend first, then subtract the trend from observations to get residuals, \(y_t - \hat{f}(x_t)\).
Difference. Skip the estimating and try first or second-order differencing of neighboring observations.

Things to consider when choosing an approach for removing the trend:

How well does it remove the underlying trend? Are the residuals (what is left over) close to zero on average, or are there still long-range patterns in the data?
Do you want to report on the estimated trend? Then estimate first.
If your main goal is prediction, try differencing.

4.2 Seasonality

Once we’ve estimated and removed the trend, we may notice cyclic patterns left in the data. Depending on the cyclic pattern’s form, we could use a few different techniques to model the seasonal pattern.

First, let’s visualize the variability in the seasonality between different years for the birth data set to see if there is a consistent cycle. To do so, we’ll work with the residuals after removing the trend (for now, let’s use the moving average filter). Since we are interested in the cycles, we want to know how the number of births varies across the months within a year.

Using the ggplot2 package, we can visualize this with geom_line() by plotting the residuals by the month and having it color each line according to the year. To do this, it creates one line per year. Interestingly, August and September have the highest residuals (approximately nine months after winter break). In this case, we see that every year follows the same pattern.

birthTS %>%
  ggplot(aes(x = Month, y = Residual, color = factor(Year))) + 
  geom_line() + 
  theme_classic()

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

4.2.1 Parametric Approaches

Similar to the trend, we can use some of our parametric tools to model the cyclic patterns in the data using our linear regression structures.

4.2.1.1 Indicator Variables

To estimate the seasonality that does not have a recognizable cyclic functional pattern (sine or cosine curve), we can model each month (or appropriate time unit within a cycle) to have its own average by including indicator variables for each month (or appropriate unit within a cycle) in a linear regression model.

Like when we estimate the trend, we want to ensure that we have fully estimated the seasonality, but check the residuals to ensure that there is no seasonality in what is left over.

birthTS <- birthTS %>%
  mutate(Month = factor(Month))

SeasonModel <- lm(Residual ~  Month, data = birthTS)
SeasonModel


Call:
lm(formula = Residual ~ Month, data = birthTS)

Coefficients:
(Intercept)       Month2       Month3       Month4       Month5       Month6  
     -4.231      -20.734        2.884      -17.412       -5.832       -4.814  
     Month7       Month8       Month9      Month10      Month11      Month12  
     20.738       27.650       24.434       18.024       -2.110        7.567

birthTS <- birthTS %>%
  mutate(Season = predict(SeasonModel, newdata = data.frame(Month = birthTS$Month))) %>%
  mutate(ResidualTS = Residual - Season)

# Estimated Trend + Seasonality
birthTS %>%
  ggplot(aes(x = DecimalTime, y = Value)) +
  geom_point() +
  geom_line() +
  geom_line(aes(x = DecimalTime, y = Trend + Season), color = 'blue', size = 1.5) + 
  theme_classic()

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

# Residuals
birthTS %>%
  ggplot(aes(x = Month, y = ResidualTS)) + 
  geom_line(aes(group = factor(Year))) + 
  geom_smooth(se = FALSE) + 
  theme_classic()

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

4.2.1.2 Sine and Cosine Curves

While it isn’t the case for this birth data set, if the cyclic pattern is like a sine curve, we could use a sine or cosine curve to model the relationship.

time <- seq(0,2*pi,length = 100)
tibble(
  t = time,
  sin = sin(time),
  cos = cos(time)
) %>%
  ggplot(aes(x = t, y = sin)) + 
  geom_line() + 
  geom_line(aes(x = t, y = cos),color = 'blue') + 
  labs(x = 'x',y = 'f(x)') +
  theme_classic()

Some properties of sine and cosine curves:

\(\cos(0) = 1\), \(\cos(\pi/2) = 0\), \(\cos(\pi) = -1\), \(\cos(3\pi/2) = 0\) \(\cos(2\pi) = 1\)
\(\sin(0) = 0\), \(\sin(\pi/2) = 1\), \(\sin(\pi) = 0\), \(\sin(3\pi/2) = -1\) \(\sin(2\pi) = 0\)
The period (length of 1 full cycle) of a sine or cosine function is \(2\pi\)
We can write a sine curve model as \(A\sin(2\pi ft + s)\) where \(A\) is the amplitude (peak deviation from 0 instead of 1), \(f\) is the frequency (number of cycles that occur each unit of time), \(t\) is the time, and \(s\) is the shift in the x-axis.

Now, let’s try to apply it to the birth data.

SineSeason <- lm(Residual ~ sin(2*pi*(DecimalTime) ), data = birthTS) # frequency is one because a unit in DecimalTime is one year
SineSeason


Call:
lm(formula = Residual ~ sin(2 * pi * (DecimalTime)), data = birthTS)

Coefficients:
                (Intercept)  sin(2 * pi * (DecimalTime))  
                   -0.04286                    -14.61751

# Estimate Trend + Seasonality
birthTS %>%
  mutate(SeasonSine  = predict(SineSeason, newdata = data.frame(DecimalTime = birthTS$DecimalTime))) %>%
  ggplot(aes(x = DecimalTime, y = Value)) + 
  geom_point() + 
  geom_line() +
  geom_line(aes(x = DecimalTime, y = Trend + SeasonSine), color = 'blue') + 
  theme_classic()

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

# Residuals
 birthTS %>%
   mutate(ResidualTS2  = Residual - predict(SineSeason, newdata = data.frame(DecimalTime = birthTS$DecimalTime))) %>%
   ggplot(aes(x = Month, y = ResidualTS2)) + 
   geom_line(aes(group = factor(Year))) +
   geom_smooth(se = FALSE) + 
   theme_classic()

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

Since we still see a pattern in these residuals, using a sine/cosine curve is not an accurate way to estimate the cyclic pattern in this particular data set.

Data Example 3

Here is the daily temperature data in San Francisco in 2011. We only have one year, but we can see that the cycle shape is closer to a cosine curve. The period, the length of one cycle, of a cosine or sine curve is \(2*\pi\), so we need to adjust the period when we fit a cosine or sine curve to our data. In the case below, the period is 365 since our time variable is DayinYear, where 1 unit is one day in the year (Jan 1 is one and Dec 31 is 365).

load('./data/sfoWeather2011.RData')
sfoWeather <- sfoWeather %>%
  mutate(DayinYear = 1:365) 
sfoWeather %>%
  ggplot(aes(x = DayinYear, y = High)) +
  geom_line() + 
  theme_classic()

CosSeason <- lm(High ~ cos(2*pi*(DayinYear-60)/365), data = sfoWeather) # frequency is 1/365 because a unit in DayinYear is 1 day
CosSeason


Call:
lm(formula = High ~ cos(2 * pi * (DayinYear - 60)/365), data = sfoWeather)

Coefficients:
                       (Intercept)  cos(2 * pi * (DayinYear - 60)/365)  
                            64.249                              -7.022

sfoWeather %>%
  mutate(Season = predict(CosSeason)) %>% 
  ggplot(aes(x = DayinYear, y = High)) +
  geom_point() +
  geom_line(aes(x = DayinYear, y = Season), color = 'blue', size = 1.5) + 
  theme_classic()

To recap:

For a cosine function,

\[A\cos(Bx + C) + D\]

the period is \(\frac{2*\pi}{|B|}\), the amplitude is \(A\) (the coefficient in our linear model), the phase shift is \(C\) (included in our model above) and the vertical shift is \(D\) (intercept in our linear model).

4.2.2 Nonparametric Approaches

Similar to the trend, we could use local regression or moving averages to estimate the trend and seasonality as long as we define the local area small enough to pick up the cycles.

The disadvantage of this approach is that we will not be able to use the estimate to make predictions in the future because we do not learn general patterns of trend and seasonality.

4.2.2.1 Differencing to Remove Seasonality

Like removing a trend, we can use differencing to remove seasonality if we don’t need to estimate it. Rather than a difference between neighboring observations, we will take differences between observations that are approximately one period apart. The period, length of one cycle, is \(k\) observations such that \(y_t\) is similar to \(y_{t-k}\) and \(y_{t-2k}\), then we’ll take differences between observations that are \(k\) observations apart.

For example, if you have monthly data and an annual cycle, you’d want to take differences between observations that are 12 lags apart (12 months apart). See the simulated monthly data below.

t <- 1:48
y <- ts(2 + 20*sin(2*pi*t/12) + rnorm(48,sd = 5), frequency=12)
plot(y)

plot(diff(y, lag = 12, differences = 1)) #first order difference of lag 12

Let’s try this approach with the birth data since we saw annual seasonality by month. First, we take first differences of neighboring values to remove the trend. Then we’ll take the first differences of lag 12 to remove the seasonality. No obvious trends or seasonality remain in the data, and it is centered around 0.

plot(diff(diff(birth,differences = 1), lag = 12, differences = 1))

What is left over, the random errors, are what we need to worry about now.

We’ll first talk about how we might model the errors from time series data and then talk about translating these ideas to other types of correlated data.