8 min read

Clustering Around Latents

Uncovering asset classes

Asset class taxonomy is similar to tax theory in that it’s both profoundly boring and fascinating at the same time. I’ll stick with what I think are the fascinating aspects: asset class grouping has a critical affect on portfolio outcomes because it shapes allocation and diversification (or lack thereof). Traditionally assets are grouped by manifest characteristics. This is a good thing, we want to have clear intuition of our asset class distinctions. Corporate debt is different than common stock because of it’s place in the capital structure and sensitivity to interest rates. Sub-asset class distinctions follow the same logic: stocks of smaller companies are different than larger companies because they have different sensitivities to the various stages of the economic cycle. Again having clear economic intuition behind asset and sub-asset distinctions is a good thing, but it should be balanced against some quantitative rigor.

Clustering around latent variables (CLV) combines the insights gained from partitioning (similar to k-means) with the data dimensionality reduction of principal component analysis (PCA). Evelyne Vigneau, Mingkun Chen and El Mostafa Qannari shared a very useful R package to perform CLV. The package is thoughtfully and thoroughly introduced in the December 2015 R Journal. My goal here is to show how the CLV framework can be applied to asset allocation.

Let’s start by getting some ETF return time-series to approximate our sub-asset classes.

# required packages --------------------------------------------------
# --------------------------------------------------------------------

# some risky assets via ETFs
tix <- c('IWM', 'IWB', 'EFA', 'VWO', 'TLT', 
         'EDV', 'EMB', 'AGG', 'JNK', 'LTPZ', 
         'GSG', 'AMLP', 'GLD', 'VNQ', 'MNA', 
         'MCRO', 'QAI', 'WTMF')

asset_name <- c('US Large Cap', 'US Small Cap', 'EAFE', 
                'EM Mkts Equity', 'US Gov\'t Bonds', 'Int\'l Gov\'t Bonds', 
                'EM Mkts Bonds', 'US Agg Bonds', 'US HY Bonds', 'US TIPS', 
                'Commodities', 'MLP', 'Gold', 'REITs',
                'Merger Arb', 'Global Macro', 'Multi-Strat', 'CTA')

price_dat <- lapply(tix, "getSymbols.yahoo", from = "1970-01-01", 
                    to = "2018-04-28", periodicity = "weekly", 
                    auto.assign = FALSE)
price_mat <- do.call("cbind", price_dat)
price <- price_mat[, seq(from = 6, to = ncol(price_mat), by = 6)]

ret <- price / lag.xts(price, 1) - 1
ret <- na.omit(ret)
colnames(ret) <- asset_name

Let’s start clustering

Ok, now that we have our returns in a matrix organized by columns (I prefer working with xts, but you don’t necessary need it here) we can start using the CLV functions. We’ll use the aptly titled function CLV to start our hierarchical partitioning. We can simply call the CLV function with our time-series matrix ret and three additional arguments: method, sX, and nmax. The method parameter is set to “local” to specify the importance of negative covariances as opposed to directional or squared covariance (i.e., negatively correlated returns should be in different groups). The sX argument is set to FALSE so the algorithm doesn’t standardize our returns (i.e., we don’t want the z-score of the returns). Finally nmax is set to 8 to only consider up to 8 possible groups or partitions.

# clustering algo
clv_res <- CLV(ret, method = "local", sX = FALSE, nmax = 8)


One of the first things I like to check during a hierarchical clustering analysis is the dendrogram. CLV has a nice dendrogram plot built into list output of the CLV function, which I’ve named clv_res. I’m going to set the bottom margin of the plot slightly larger than the default setting to allow for our sub-asset class names to fit.

# check dendrogram
opar <- par()
par(mar = c(15, 4.1, 4.1, 2.1))
plot(clv_res, type = 'dendrogram')

par(mar = opar$mar)

We can learn much from dendrograms. If we were to rotate this chart 90 degrees clockwise it would look like a funky tournament bracket. As tempting as that might be, it’s probably best to interpret the graph in its current state. Starting from the top we can see two big groups of assets: reading left to right group one is US Large Cap to MLP and group 2 is US Gov’t Bonds to Gold. Group one then breaks out into two sub-groups: US Large Cap to US HY Bonds and Commodities to MLP. Group 2 has a sub-group of Gov’t Bonds and TIPS and then Gold breaks out by itself. The distance of assets and the legs (or lines) above them are meaningful. Notice US Large Cap and Small Cap are next to each other and have short legs. This can be interpreted as Small and Large Cap equities are highly correlated and don’t offer much diversification when combined in a portfolio. However the leg above the group of US Large Cap to MLP is much longer and implies a portfolio mix of this group would have great diversification benefits when combined with the US Gov’t Bond to Gold group. Please keep in mind some caveats. The key input into the clustering algorithm is covariance and CLV is set up to calculate historical covariance, so in our example we’re looking at historical covariance with an inception of 2011. It’s worth some time and effort to bend (or possibly rebuild) our CLV function so it can handle robust covariance estimation. Or perhaps it’s even better to set up a multi-level model to first estimate covariance for financial time-series in our preferred way and then pass the covariance matrix into the clustering model. Regardless of how we estimate diversification another thing we need to be cautious of is poor returns appearing diversifying (especially in a bull market). In our dendrogram Gold appears very diversifying by itself, it has a long leg directly above it’s label at the bottom. When we investigate the performance of an investment in the Gold ETF over this time period:

plot(cumprod(1 + ret[, "Gold"]), main = "GLD Cumulative Wealth Index")

we see the big drawdown from the August 2011 debt ceiling crisis to January 2016. This is an important lesson. Diversification and volatility reduction are great but don’t forget the power of the first moment (or line-item drawdowns appear very diversifying).

How many clusters?

One of the great things about the CLV package is its built in functions on determining the optimal number of groups. This decision is the crux of the clustering model specification. Our goal is to get nice groups of assets that are reasonably orthogonal to the other groups and make economic sense. If we specify too many groups we’ll lose our orthogonality and have clusters that aren’t meaningfully different than their neighbors. If we calibrate too few groups we’ll miss out on opportunities to make key distinctions to help diversify our portfolio.

Let’s follow the delta plot the package authors recommend.

plot(clv_res, type = 'delta')

This graph shows the variation of criterion from going to K to K - 1 clusters. We can see this value is fairly flat until we go from 2 to 1 clusters. Or put another way, going from 1 to 2 clusters is meaningful, while the additional cluster seperations, 2 to 3, 3 to 4, and so forth aren’t meaninful. According to the variation of criterion or delta plot method 2 clusters is our optimal partition.

Let’s take a look at our 2 groups of assets.

summary(clv_res, 2)
## $number
## clusters
##  1  2 
## 12  6 
## $groups
## $groups[[1]]
##                cor in group  cor next group
## US Small Cap           0.90           -0.35
## EAFE                   0.88           -0.29
## EM Mkts Equity         0.87           -0.14
## US Large Cap           0.86           -0.37
## US HY Bonds            0.77           -0.07
## Multi-Strat            0.76            0.02
## REITs                  0.72            0.06
## Global Macro           0.71            0.10
## MLP                    0.65           -0.14
## Commodities            0.62           -0.15
## EM Mkts Bonds          0.60            0.28
## Merger Arb             0.50           -0.10
## $groups[[2]]
##                   cor in group  cor next group
## US Gov't Bonds            0.93           -0.33
## Int'l Gov't Bonds         0.91           -0.35
## US TIPS                   0.86           -0.03
## US Agg Bonds              0.84           -0.08
## Gold                      0.52            0.16
## CTA                       0.16           -0.05
## $set_aside
## $cormatrix
##       Comp1 Comp2
## Comp1  1.00 -0.19
## Comp2 -0.19  1.00

These groups look solid. We’ve got a group of assets that share equity risk factors and term risk factors. In the equity group Merger Arb appears on the fringe of being a separate group and in the fixed income group Gold and CTAs look like they could also potentially form a separate group. However, taken as a whole, the two groups have a nice low correlation of -0.19 to each other. There are many benefits to understanding these two groups. Two related advantages that jump out are diversification decisions and dimensionality reduction in covariance estimation. With respect to creating balanced portfolios we can see that any rebalancing within groups will not be as effective as balancing between groups. Furthermore, we could recalculate a patterned correlation matrix where each group has a single correlation (e.g. 1 or an average of in group correlation) to all the members of its group and a correlation of zero to the other group(s).

As a final check or insight I always enjoy a good scatter plot of each ETFs loading on the first two components of the PCA. Fortunately, our CLV function has a built in plot. I’ll follow up with more posts on some expanding topics mentioned here: robust correlation / covariance estimation and how to think about weighting assets once we’ve formed our groups.

plot_var(clv_res, K = 2)