# Yield Curve PCA

## Summary

Term or interest rate models are often simplified to the 10 year yield less a short-term (e.g., 3 month) yield. However, the dynamics of the yield curve aren’t that simple: the curve can shift up and down in a parallel fashion, it can steepen or flatten, and it can twist in multiple directions. The challenge with incorporating all the key tenors of the curve is correlation, if we simply use all the key tenors as explanatory variables we’d introduce the problem of multicolinarity to our model. Even if we use a nice LASSO technique we’ll likely only bring one tenor into our model because of the correlation between tenors. PCA is a good tool for this problem. We can use PCA to create orthogonal factors in our interest rate model. I’ll show how to extract PCA factors from the key tenors of the yield curve and an example of a factor model to examine a fixed income mutual fund.

## Required packages and utility functions

# required packages ----
library(tidyverse)
library(lubridate)
library(reshape2)
library(quantmod)
library(iilasso)

# utility functions ----

ticker, ".csv")
temp_file <- tempfile()
d <- read.csv(temp_file, na.strings = ".")
d <- d %>%
mutate(DATE = as.Date(DATE))
return(d)
}

# convert price to return
toDelta <- function(x) {
as.vector(x) / lag(as.vector(x)) - 1
}

# truncate outlier yield changes
truncateDelta <- function(x, ub, lb) {
x[x > ub] <-NA
x[x < lb] <- NA
return(x)
}

We’ll pull the 3 month, 6 month, and 1, 2, 3, 5, 7, 10, and 20 year yields from FRED, the St. Louis Federal Reserve’s great economic database. The yield time-series range from 1993 to present (cut off at year-end 2018 at the time of this post). The code below downloads the key tenors by ticker, arranges the time-series in a matrix-like data.frame organized by columns, converts the yield levels to month over month percent changes, and cleans the outliers (e.g., if the yield goes from 0% to 0.25% the change is not useful). After transforming the time-series to month over month change, the code runs the PCA through the R stats function princomp. The scores or loadings of the PCA are stored in the pc_yield_fact data.frame. Before we use the scores we’ll investigate each component to see what kind of ‘factors’ we have.

# yield tickers, 3 and 6 month, 1, 2, 3, 5, 7, 10, and 20 year
yield_tick <- c('DGS3MO', 'DGS6MO', 'DGS1', 'DGS2', 'DGS3', 'DGS5', 'DGS7',
'DGS10', 'DGS20')

# download and merge into data.frame organized by columns like a matrix
reduce(full_join, by = 'DATE')
colnames(yield_mat) <- c('date', yield_tick)

# convert to monthly, take percent change in yield, clean up, and run PCA
yield_mat <- yield_mat %>%
na.omit() %>% # remove missing values before changing to monthly freq
group_by(x = strftime(date, '%Y-%m')) %>% # change to monthly freq
filter(date == max(date)) %>% # by filtering to last day of each month
ungroup() %>% # ungroup so date vectors can be removed in the next pipe
select(-x) %>%
mutate_at(vars(-date), toDelta) %>% # level to percent change
mutate_at(vars(-date), truncateDelta, lb = -0.50, ub = 0.50) %>% # truncate outliers
na.omit() %>%
mutate(date = ceiling_date(date, 'months') - 1) %>%
filter(date < as.Date('2019-01-01'))

# PCA output
p <- yield_mat %>%
select(-date) %>%
princomp()

# PCA factors
pc_yield_fact <- data.frame(date = yield_mat$date, p$scores)

## Evaluating the Components

d <- data.frame(p$loadings[,], tenor = rownames(p$loadings[,])) %>%
melt(id = 'tenor') %>%
mutate(tenor = factor(tenor, levels = yield_tick))
ggplot(data = d, aes(x = tenor, y = value)) +
geom_bar(stat = 'identity') +
facet_wrap(variable ~.) +
theme(axis.text.x = element_text(angle = 90)) The first component is a negative loading on yield changes across tenors. Our yield curve data set ranges from 1993 to 2018. It makes sense the biggest driver of explaining the curve is a general downward movement in rates. The second component looks like a flattening of the curve exhibited by the short-end rising and the long end dropping. Although, there’s a slight twist here as well as the middle of the curve drops quicker than the long-end. The third component appears to be a more pronounced twist with a falling, rising, flat, falling pattern. The first three components explain nearly 80% of the total variance (graph below). This is intuitive, the classic yield curve shifts are parallel, steeping / flattening, and twisting. The later components aren’t as useful in explaining the yield curve dynamics but could be useful in explaining a bond portfolio. For example the 9th component looks like a long 7s and 20s while short 10s bet.

d <- data.frame(comp = 1:9, var_explained = cumsum(p$sdev) / sum(p$sdev))
ggplot(data = d, aes(x = comp, y = var_explained)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:9) +
scale_y_continuous(labels = scales::percent) +
xlab('Component') + ylab('Portion of Variance Explained') ## Factor Analysis Example

The code below shows an example of how to use the PCA scores to build a factor model for a bond portfolio. The FRED data library has several credit spreads, we’ll grab the ICE BofAML U.S. Corporate Master Option-Adjusted Spread to give us a general sense of the credit risk the manager is (or isn’t) taking. We’ll also take the 10 year and 3 month yields from our yield data.frame to compare the PCA factor model to a simpler 10 year less 3 month representation of the term factor (i.e., interest rate risk). We’ll use the Dodge and Cox Income fund (DODIX) as our example bond fund. The DODIX time-series is downloaded from Yahoo finance, the getSymbols.yahoo function returns OHLC time-series. We’re interested in the 6th column that contains closing prices adjusted for corporate actions.

Using all nine components probably isn’t necessary. The spirit of Occam’s razor applies to factor models: all thing being equal, simpler explanations are better. We’ll use the nice iilasso package to select our explanatory variables. There’s much to be said of LASSO models and variable selection that I’m glossing over here. Check out the iilasso CRAN page for more reading and credit to Takada, Suzuki, & Fujisawa (2018).

# download and clean the credit spread time-series
rename(date = DATE, credit = VALUE) %>%
na.omit() %>% # remove missing values before changing to monthly freq
group_by(x = strftime(date, '%Y-%m')) %>% # change to monthly freq
filter(date == max(date)) %>% # by filtering to last day of each month
ungroup() %>% # ungroup so date vectors can be removed in the next pipe
select(-x) %>%
mutate_at(vars(-date), toDelta) %>% # level to percent change
mutate(date = ceiling_date(date) - 1) %>%
mutate(date = as.Date(date))

# 10 year yield for comparison
term_10 <- yield_mat %>% select('date', 'DGS10', 'DGS3MO')

fund_raw <- getSymbols.yahoo('DODIX', from = '1970-01-01',
to = '2018-12-31', periodicity = 'm',
auto.assign = FALSE)[, 6]
df <- data.frame(date = zoo::index(fund_raw), value = fund_raw, row.names = NULL)
colnames(df) <- c('date', 'value')
fund <- df %>%
mutate(value = toDelta(value)) %>%
mutate(date = ceiling_date(date, 'months') - 1)

# combine explanatory (factors) and indepedent (fund) variables
fact_data <- fund %>%
na.omit() %>%
merge(pc_yield_fact, by = 'date') %>%
merge(term_10, by = 'date') %>%
merge(credit, by = 'date')%>%
na.omit() %>%
mutate(value = value - DGS3MO, DGS10 = DGS10 - DGS3MO) %>%
select(-date, -DGS3MO)

# credit plus 10 year yield model
fact_data_term_10 <- fact_data %>%
select(value, DGS10, credit)

# credit plus term PCA model
fact_data_pc <- fact_data %>%
select(-DGS10)

# seperate X and y variables and convert to matrix for iilasso function
X <- as.matrix(fact_data_pc[, -1])
y <- as.matrix(fact_data_pc[, 1])
cv_fit <- cv_lasso(X, y)
fit <- cv_fit$fit is_fact <- abs(fit$beta[, cv_fit\$lambda.1se.index]) > 0.01

# run linear regressions for both models
lm_pc_fit <- lm(value ~., data = fact_data_pc[, c(TRUE, is_fact)])
lm_10_fit <- lm(value ~., data = fact_data_term_10)

# quick check to make sure the variables are uncorrelated
cor(fact_data_pc[, c(FALSE, is_fact)]) %>% round(digits = 2)
##        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 credit
## Comp.1   1.00  -0.02  -0.01   0.01   0.03   0.36
## Comp.2  -0.02   1.00  -0.01  -0.05   0.04   0.06
## Comp.3  -0.01  -0.01   1.00   0.01   0.01   0.18
## Comp.4   0.01  -0.05   0.01   1.00   0.04   0.07
## Comp.5   0.03   0.04   0.01   0.04   1.00   0.00
## credit   0.36   0.06   0.18   0.07   0.00   1.00
cor(fact_data_term_10[, -1]) %>% round(digits = 2)
##        DGS10 credit
## DGS10   1.00   0.01
## credit  0.01   1.00
summary(lm_10_fit)
##
## Call:
## lm(formula = value ~ ., data = fact_data_term_10)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.31951 -0.04041  0.00356  0.04344  0.27006
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001542   0.004598   0.335    0.738
## DGS10       0.864171   0.026828  32.211  < 2e-16 ***
## credit      0.241167   0.047031   5.128 6.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06987 on 230 degrees of freedom
## Multiple R-squared:  0.8227, Adjusted R-squared:  0.8212
## F-statistic: 533.7 on 2 and 230 DF,  p-value: < 2.2e-16
summary(lm_pc_fit)
##
## Call:
## lm(formula = value ~ ., data = fact_data_pc[, c(TRUE, is_fact)])
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.034691 -0.008589 -0.001379  0.007551  0.043979
##
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)
## (Intercept)  0.0086715  0.0008399   10.325  < 2e-16 ***
## Comp.1       0.3195163  0.0033675   94.883  < 2e-16 ***
## Comp.2      -0.7780707  0.0049147 -158.315  < 2e-16 ***
## Comp.3       0.3789400  0.0095631   39.625  < 2e-16 ***
## Comp.4       0.2921111  0.0131281   22.251  < 2e-16 ***
## Comp.5      -0.1724189  0.0159186  -10.831  < 2e-16 ***
## credit      -0.0508252  0.0094417   -5.383 1.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01277 on 226 degrees of freedom
## Multiple R-squared:  0.9942, Adjusted R-squared:  0.994
## F-statistic:  6437 on 6 and 226 DF,  p-value: < 2.2e-16

The simple 10 year term and credit model results in a good fit. The adjusted r-squared is 0.82 and the residual or intercept is not significant. The positive loading on the month over month change in credit spread is surprising. The fund has some exposure to lower-tier investment grade bonds and had a slight drawdown during 2008 / 2009 when the credit spread was spiking. I expected a negative relationship: rising credit spreads detracting from performance and decreasing spreads boosting performance.

The PCA model boosts the adjusted r-squared to 0.99 and has a slightly negative credit beta. However, the residual is significant with a high t-value and annualizes to around 10%. The higher r-squared and intuitive credit loading are clear improvements over the 10 year yield approximation of interest rate risk. The high residual leaves me believe that model is incomplete, perhaps currency or an MBS-like factor is missing. Regardless, I think the PCA model is an improvement here and better captures the dynamic of the yield curve.