2 Forecasting Methods

2.1 Quantile Models

When forecasting quantiles of a random variable \(Y\), we must pay close attention to the assumptions behind the models being used. For instance, a standard Value at Risk (VaR) model used to estimate a quantile \(q_\alpha\) at level \(\alpha\) relies on the assumption of stationary and normally distributed asset returns, which tends to underestimate the probability of large losses. This can be especially problematic for volatile cryptocurrency returns which exhibit both volatility clustering and fat tails as evidenced in Figures 2.1 and 2.2. 6 In this chapter we present some statistical models and estimation methods that can be used to estimate \(\text{VaR}_\alpha^Y\), which we will often denote as \(\text{VaR}_\alpha\) for simplicity. In the following section, we will compare the results obtained using common VaR forecasting models such as GARCH models to those obtained using quantile regression models of the \(\text{CAViaR}\) family proposed by Engle and Manganelli (2004). Here \(\boldsymbol{y} = (y_1, y_2, \dots, y_t)\) denotes a numeric vector of log-returns and we follow these definitions following Ruppert and Matteson (2011):

\[\begin{align} \text{VaR}(\alpha) = \inf \{ x: \Pr\left( \mathcal{L} > x\right) \leq \alpha \} \tag{2.1} \end{align}\]

and,

\[\begin{align} \text{ES}(\alpha) = \frac{\int_0^\alpha \text{VaR}(u) du}{\alpha} \tag{2.2} \end{align}\]

where \(\mathcal{L}\) in Equation (2.1) is the loss over a period \(T\) and \(\text{VaR}(\alpha)\) denotes the Value at Risk quantile at level \(\alpha\) or \(\text{VaR}_\alpha\). This value corresponds to a quantile of the return vector \(\boldsymbol{y}\). Equation (2.2) defines the expected shortfall or average loss in case \(y_t\) exceeds \(\text{VaR}_\alpha\). We now discuss different methods used for forecasting \(q_\alpha\).

Figure 2.1: Historical returns (BTCUSD), prices from Yahoo! finance.

Figure 2.2: Fat tailed returns (BTCUSD), prices from Yahoo! Finance.

Using Equation (2.1) and values for \(\alpha \in (0, 1)\), we can compare the different values obtained for \(\hat{q}_\alpha\) using a normal and a Student-t distribution vs. the empirical quantiles as explained next.

2.1.1 Nonparametric

The first way we will try to get an estimate of \(\text{VaR}_\alpha\) is by using the empirical quantile function \(\hat{q}(\alpha)\), which can be easily obtained with the R quantile function. Non parametric methods do not require assumptions about the distribution of \(Y\). The following two expressions are used to compute the Value at Risk and expected shortfall in this case:

\[\begin{align} \widehat{\text{VaR}}(\alpha) &= - \hat{q}(\alpha) \\ \widehat{\text{ES}}(\alpha) &= - \frac{ \sum_{t=1}^T y_t \cdot I \{y_t < \hat{q}(\alpha) \} } { \sum_{t=1}^T I\{ y_t < \hat{q}(\alpha) \} } \end{align}\]

2.1.2 Parametric

The standard parametric technique makes use of the normal distribution. Based on a sample vector of size \(T\) we model \(\boldsymbol{y}\) as a random variable following a normal distribution of fixed mean \(\mu\) and variance \(\sigma^2\), e.g., \(Y \sim N(\mu, \sigma^2)\). Since \(Y \overset{\text{iid}}{\sim} N(\mu, \sigma^2)\), we compute the quantile \(q_\alpha\) using the well known expression:

\[\begin{align} \widehat{\text{VaR}}(\alpha) = \hat{\mu} + \hat{\sigma} \times \Phi^{-1}(\alpha) \end{align}\]

where \(\Pr(Y \leq q_\alpha) = \alpha\) and \(\Phi^{-1}(\alpha)\) is the inverse of the cumulative distribution function (c.d.f) of a standard normal distribution. To obtain \(\Phi^{-1}(\alpha)\), we can use the qnorm R function, whose first parameter is the probability \(\alpha\). Table 2.1 and Figure 2.3 show the quantiles estimated for values of \(\alpha \in \{0.01, 0.02, \dots, 0.99 \}\) using two standard parametric methods vs. the empirical quantiles. In the case of the normal distribution, the unbiased estimators below are commonly used since the Maximum Likelihood (MLE) estimator of \(\sigma^2\) does converge to the unbiased estimator when \(T \to \infty\).

\[\begin{align} \hat{\mu} &= \frac{1}{T} \sum_{t=1}^T y_t \tag{2.3}\\ \hat{\sigma}^2 &= \frac{1}{T-1} \sum_{t=1}^T (y_t - \hat{\mu})^2 \end{align}\]

where \(\mu\) in Equation (2.3) is frequently assumed to be \(0\) for high frequency financial applications. The expected shortfall \(\text{ES}_\alpha (Y)\) is defined as

\[\begin{align} \text{ES}_\alpha (Y) = \frac{1}{\alpha} \int_0^\alpha \text{VaR}_\gamma (Y) d \gamma, \end{align}\]

which translates into the following closed-form expression in the case of a normal distribution:

\[\begin{align} \widehat{\text{ES}}(\alpha) = -\hat{\mu} + \hat{\sigma} \frac{\phi(\Phi^{-1}(\alpha))}{\alpha} \tag{2.4} \end{align}\]

Figure 2.3: Parametric vs. empirical quantiles (BTC/USD).

Table 2.1: Parametric vs. empirical quantiles.
1% 5% 10%
Empirical -10.17 -5.94 -4.0
Student -10.95 -5.65 -3.9
Normal -9.17 -6.45 -5.0

where \(\phi(x)\) in Equation (2.4) is the probability density function (p.d.f) of a standard normal distribution. As evidenced in Figure 2.1, periods of extreme returns are clustered and exhibit autocorrelation, e.g., extreme negative returns are more likely to happen during highly volatile periods. The generalized autoregressive conditional heteroskedastity (GARCH) model introduced by Bollerslev (1986) tackles this problem with the GARCH(p,q) model shown below:

\[\begin{align} Y_t &\sim N(0, \sigma_t^2) \\ \sigma_t^2 &= \omega + \sum_{i=1}^p\alpha_i y_{t-1}^2 + \sum_{j=1}^q\beta_j \sigma_{t-1}^2 \tag{2.5} \\ \text{k-step ahead forecast} \nonumber \\ \sigma_t^2(k) &= \begin{cases} \hat{\omega} + (\hat{\alpha}_1 \hat{y}_t^2 + \hat{\beta}_1) \hat{\sigma}_t^2 &k = 1 \tag{2.6} \\ \hat{\omega} + (\hat{\alpha}_1 + \hat{\beta}_1) \hat{\sigma}_t^2(k - 1) &k > 1 \end{cases} \end{align}\]

The popular GARCH(1,1) model in Equation (2.7) below describes the variance \(\sigma_t^2\) using three pararameters only since \(p = q = 1\).

\[\begin{align} \sigma_t^2 &= \omega + \alpha y_{t-1}^2 + \beta \sigma_{t-1}^2 \tag{2.7} \end{align}\]

In this case our parameter vector is \(\boldsymbol{\theta} = (\omega, \alpha, \beta) = (\theta_1,\theta_2,\theta_3)\).The well-known RiskMetrics model (Morgan and Reuters 1996) is a specific case of Equation (2.7) where \(\omega = 0\), \(\alpha = (1 - \lambda)\) and \(\beta = \lambda\):

\[\begin{align} \sigma_t^2 &= (1 - \lambda) y_{t-1}^2 + \lambda \sigma_{t-1}^2 \tag{2.8}. \end{align}\]

2.1.3 Semiparametric

The CAViaR model (Engle and Manganelli 2004) describes the behaviour of a quantile \(q_{\alpha}\) at level \(\alpha \in \{0, \dots, 1\}\) as an autoregressive process taking one of the following forms :

  • Adaptive: \[\begin{align} f_{t}(\boldsymbol{\theta}) &= f_{t-1}(\theta_1) + \theta_1 \left( \frac{1}{1+e^{k \cdot \left(y_{t-1} + f_{t-1}(\theta_1)\right)}} -\alpha \right) \tag{2.9} \end{align}\]

  • Symmetric absolute value: \[\begin{align} f_{t}(\boldsymbol{\theta}) &= \theta_1 + \theta_2 f_{t-1}(\boldsymbol{\theta}) + \theta_3 \lvert y_{t-1} \rvert \tag{2.10} \end{align}\]

  • Asymmetric slope: \[\begin{align} f_{t}(\boldsymbol{\theta}) &= \theta_1 + \theta_2 f_{t-1}(\boldsymbol{\theta}) + \theta_3 \lvert y_{t-1} \rvert I_{\left(y_{t-1} > 0\right)} + \theta_4 \lvert y_{t-1} \rvert I_{\left(y_{t-1} < 0\right)} \tag{2.11} \end{align}\]

  • Indirect GARCH(1,1): \[\begin{align} f_{t}(\boldsymbol{\theta}) &= \left[ \theta_1 + \theta_2 f_{t-1}^2(\boldsymbol{\theta}) + \theta_3 y_{t-1}^2 \right]^{1/2} \tag{2.12} \end{align}\]

  • Threshold CAViaR (Gerlach, Chen, and Chan 2011):

\[\begin{align} f_{t}(\boldsymbol{\theta}) = \begin{cases} \theta_1 + \theta_2 f_{t-1}(\boldsymbol{\theta}) + \theta_3 \lvert y_{t-1} \rvert ,& y_{t-1} \leq 0 \\ \theta_4 + \theta_5 f_{t-1}(\boldsymbol{\theta}) + \theta_6 \lvert y_{t-1} \rvert,& y_{t-1} > 0 \tag{2.13} \end{cases} \end{align}\]

where Gerlach, Chen, and Chan (2011) proposed a generalized form for the \(\text{CAViaR}\) model through Equation (2.13). The threshold variable \(y_{t-1}\) in the Threshold CAViaR (T-CAViaR) model above could also be an exogenous variable, while the threshold value could be \(\neq 0\). It is important to emphasize that \(f_{t}(\boldsymbol{\theta})\) returns the quantile of interest at level \(\alpha\) in contrast to the standard GARCH model in Equation (2.5) which models \(\sigma_t^2\). This direct modelling of \(\text{VaR}_\alpha\) at time \(t\) can be expressed as follows:

\[\begin{align} \text{VaR}_t = f_t(\boldsymbol{\theta} \vert \mathcal{I}_{t-1}) + \epsilon_t \tag{2.14} \end{align}\]

Regardless of the specifics of each model, we also need to find a way to get an estimate \(\hat{\theta}\). In the following sections, we present the general optimization and Monte Carlo methods we tested for estimating the \(\text{CAViaR}\) model based on a vector of observed returns \(\boldsymbol{y} = (y_1, y_2, \dots, y_{t=T})\).

2.2 Maximum Likelihood Estimation

The likelihood \(L(\boldsymbol{\theta})\) function of a model is defined as \[\begin{align} L(\boldsymbol{\theta}) = \prod_{t=1}^T f_{\boldsymbol{\theta}}(y_t), \tag{2.15} \end{align}\]

where \(\boldsymbol{y} = (y_1, y_2, \dots, y_{t = T})\) represents the observed data vector. To find a maximum likelihood estimation (MLE) of \(\boldsymbol{\theta}\), we need to find a vector \(\theta \in \Theta\) which maximizes \(L(\theta)\) 7 given \(\boldsymbol{y}\). For this, we use the following objective function at quantile level \(\alpha\):

\[\begin{align} f(\boldsymbol{\theta}, \alpha, \boldsymbol{y}) = \frac{1}{T} \sum_{t=1}^{T}\left(\alpha - I_{\left(y_t < f_t(\boldsymbol{\theta})\right)} \right) (y_t - f_t(\boldsymbol{\theta})) \tag{2.16} \end{align}\]

Function (2.16) is the Regression Quantile Criterion proposed by Engle and Manganelli (2004). When using adaptive MCMC methods, the likelihood function is slightly modified as shown by Gerlach, Chen, and Chan (2011) under the assumption that the error term in Equation (2.14) follows a Skewed-Laplace (SL) distribution (Yu and Moyeed 2001). In order to estimate the parameters of the \(\text{CAViaR}\) model, the natural logarithm of Equation (2.15) must be given as objective function to an optimizer and the values of \(\theta\) which maximize the log-likelihood is our estimate \(\hat{\theta}\).

Figure 2.4: Maximum likelihood estimation of the parameter vector.

The performance of this method depends on the complexity of the target distribution. Multiple mathematical assumptions on the shape of the likelihood function are needed by most numerical optimization methods whose validity depend on the asymptotic behaviour of the estimators, see Dorsey and Mayer (1995) for a summary of such difficulties.

2.2.1 Starting values

When looking for the global maximum of a likelihood function, we must pick a starting point \(\theta_1 \in \Theta\). The optimizer will then apply some sort of algorithm (Nelder-Mead, genetic, simulated annealing, etc.) and stop when one of the following happens: the maximum number of iterations has been reached, an error caused by the shape of the function has occurred or the algorithm has converged.

Given the non-linearity of the \(\text{CAViaR}\) model, there are good chances that any optimizer will get stuck at a local maximum. To handle this, it is common practice to create a grid in \(\Theta\) and run the numerical procedure several times from each point in the grid. After doing this, we end up with as many local maxima as points in the grid. The global maximum is then identified from this set as shown in Figure 2.5.

Figure 2.5: Local and global minima of a sample three-dimensional function.

2.2.2 Optimization algorithms

In finding the maximum likelihood estimator of the \(\text{CAViaR}\) family of models, we tested the following algorithms: 8

  • Quasi-Newton (variable metric algorithm) as implemented by the stats::optim() function.
  • One-dimensional optimization by Brent (1973) for Equation (2.9) as implemented by stats::optim().
  • Genetic algorithm as implemented by the GA package (Scrucca 2021) through the GA::ga() function.
  • Particle swarm optimization by Clerc (2010) as implemented in the pso package (Bendtsen. 2012) through the pso::psoptim() function.

2.3 Markov Chain Monte Carlo

In contrast to the frequentist (or classical) MLE method, MCMC algorithms are Bayesian methods which allow us to get a sample of \(\boldsymbol{\theta}\). It is this sample that will be used to draw probabilistic conclusions about \(\boldsymbol{\theta}\) instead of asymptotic results as it is the case with the MLE method. The Bayesian estimation method requires the use of the following expression (Ruppert and Matteson 2011):

\[\begin{align} \pi (\boldsymbol{\theta} \vert \boldsymbol{y}) &= \frac{\pi (\boldsymbol{\theta}) f(\boldsymbol{y} \vert \boldsymbol{\theta})}{f(\boldsymbol{y})} = \frac{\pi (\boldsymbol{\theta}) f(\boldsymbol{y} \vert \boldsymbol{\theta})}{\int_{\Theta} \pi (\boldsymbol{\theta}) f(\boldsymbol{y} \vert \boldsymbol{\theta}) d \boldsymbol{\theta}} \propto \pi (\boldsymbol{\theta}) f(\boldsymbol{y} \vert \boldsymbol{\theta}) \tag{2.17} \end{align}\]

where \(\pi \left(\boldsymbol{\theta} \vert \boldsymbol{y}\right)\) is called the posterior density, a function returning the distribution of \(\boldsymbol{\theta}\) conditional on the observed data \(\boldsymbol{y}\). Here \(\pi \left(\boldsymbol{\theta}\right)\) is the prior density, which represents our prior beliefs about the parameter vector \(\boldsymbol{\theta}\). In the MCMC implementations below uninformative priors, following a uniform distribution, were used. The functions \(f(\boldsymbol{y})\) and \(f(\boldsymbol{y} \vert \boldsymbol{\theta})\) are the marginal and conditional densities respectively.

Equation (2.17) is an important tool in Bayesian statistics as it tells us how to update the prior probabilities after observing the data, this relationship is sometimes referred to as:

\[\begin{align} \text{Posterior} \propto \text{Prior} \times \text{Likelihood} \end{align}\]

where \(\propto\) means proportionality up to a constant, i.e., \(1 / \int \pi (\boldsymbol{\theta}) f(\boldsymbol{y} \vert \boldsymbol{\theta}) d\boldsymbol{\theta}\) in Equation (2.17). Using Equation (2.17) and a vector of log-returns \(\boldsymbol{y}\), we obtain a Markov chain \(\boldsymbol{\theta}_{N \times d}\) with each column representing a variable \(\boldsymbol{\theta}_{i}, \forall i \in \{1, \dots, d\}\) and each row an iteration. This chain is the output sample of interest used for creating posterior intervals.

Before we go any deeper into the MCMC algorithm, we might wonder why using MCMC methods instead of classical frequentist methods. An important difference between frequentist and Bayesian methods is that the former consider the parameter vector \(\boldsymbol{\theta}\) as fixed (within a random interval) while the latter treat it as a random variable (Efron 1986). An important reason for choosing to work with Markov chains comes from the minimal requirements on the target distribution \(f\) (C. P. Robert and Casella 2009). The properties of this sequence \(\{\boldsymbol{\theta}^{(t)}\}\) make their use possible in Bayesian analysis since they enjoy a strong stability property and a stationary distribution \(f\) exists by construction such that:

\[\begin{align} \boldsymbol{\theta}^{(t)} \sim f, \text{ then } \boldsymbol{\theta}^{(t + 1)} \sim f \tag{2.18} \end{align}\]

The stationarity property due to the existence of \(f\) imposes a constraint of irreducibility on the transition kernel \(K(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t + 1)})\), which means that the function \(K\) must allow the sequence to visit every region of the state-space \(\Theta\). An important computational property of such Markov chains is called ergodicity, which means that the limiting distribution of \(\boldsymbol{\theta}^{(t)}\) will be \(f\) regardless of the algorithm’s initial value \(\boldsymbol{\theta}^{(0)}\), meaning that simulating a long enough chain \(\boldsymbol{\theta}^{(t)}\) from \(K\) with stationary distribution \(f\) will produce simulations from \(f\) which can be used to draw probabilistic conclusions about \(\boldsymbol{\theta}\).

The Markov Chain Monte Carlo (MCMC) methods allow us to get a sample of \(\boldsymbol{\theta}\) from a target distribution using a Markov chain. When using this method, we obtain a sample of \(\boldsymbol{\theta}\) based on a batch of size \(NN\) with chains of length \(N\). Instead of finding a vector which maximizes Equation (2.15), this method outputs multiple sample chains which are then used to obtain an estimate \(\boldsymbol{\hat{\theta}}\). The general algorithm is as follows:

Figure 2.6: MCMC algorithm

Algorithm:

Step 1: Pick the length of the chain \(N\) and the number of batches \(NN\)
Step 2: Obtain a sample matrix \(\boldsymbol{\theta_{1:N}}\)
Step 3: Set \(\boldsymbol{\theta_1} = \boldsymbol{\bar{\theta}_{1:N}}\)
Step 4: Repeat Step 1 to Step 3 \(NN\) times
Step 5: Set \(\boldsymbol{\hat{\theta}} = \boldsymbol{\bar{\theta}_{1:NN}}\)

where \(\boldsymbol{\bar{\theta}_{1:N}}\) denotes the mean values of the columns of \(\boldsymbol{\theta_{1:N}}\). The MCMC algorithms we used to estimate the \(\text{CAViaR}\) family of models fall in a general classification of the MCMC methods: the Metropolis–Hastings algorithm (Metropolis and Ulam 1949). Under this category, the following implementations are tested:

  1. Adaptive random walk Metropolis-Hastins with Gaussian proposal following Atchadé and Rosenthal (2005) and Chen and So (2006).
  2. Adaptive random walk Metropolis–Hastings with student-t proposal during burn-in and independent kernel for sampling (Gerlach, Chen, and Chan 2011).
  3. Robust adaptive Metropolis (Vihola 2012).

For the first two implementations, a Metropolis within Gibbs strategy is used during adaptation following Roberts and Rosenthal (2009). In the following algorithms, we will use \(x\) and \(y\) to denote the current point in the chain \(\theta^{(t)}\) and the proposal \(Y_t\) following C. P. Robert and Casella (2009).

2.3.1 Standard Metropolis-Hastings

Algorithm 4 (C. P. Robert and Casella 2009)

Step 1: Generate \(Y_t \sim q\left(y \vert \theta^{(t)}\right)\).
Step 2: Set

\[\begin{align} \theta^{(t+1)} = \begin{cases} Y_t &\text{with probability } \rho\left(\theta^{(t)}, Y_t\right) \\ \theta^{(t)} &\text{with probability } 1 - \rho\left(\theta^{(t)}, Y_t\right) \tag{2.19} \end{cases} \end{align}\] where \[\begin{align} \rho(x,y) = \min \bigg\{ \frac{f(y)}{f(x)}\frac{q(x \vert y)}{q(y \vert x)}, 1 \bigg\} \end{align}\]

Below we show the equivalent lines of C++ code from our implementation in the caviarma package:

q_x = 1.0;
q_y = 1.0;
for (int i = burn_in; i < nsim; i++) {  // Sampling
  x.row(i) = x.row(i-1);
  Y = arma::mvnrnd(burn_in_mu, burn_in_sigma, 1);
  f_x = BetaPosterior(x.row(i).t());
  f_y = BetaPosterior(Y);

  prob = exp(f_y + q_x - f_x - q_y);
  rho = std::min(1.0, prob);

  if (arma::randu() < rho) {
    x.row(i) = Y.t();
    sampling_counter++;
  }
}  // End of Sampling

Where x.row(i) is the i-th row of an arma::mat object (Sanderson and Curtin 2016) which stores the value of \(\theta^{(t)}\). The output obtained by running the generic MCMC algorithm above is a \(\boldsymbol{\theta}_{N \times d}\) matrix. This implementation happens to be a special case of the more generic algorithm 4 since the proposal \(q\) is a multivariate centrally symmetric (around zero) distribution, i.e., \(q(\boldsymbol{\theta - 0}) = q(\boldsymbol{0 - \theta})\). We call this symmetric function \(g\), examples of which are the standard multivariate normal and standard student-t distributions (Serfling 2014). This property implies that \({q(x \vert y)}/{q(y \vert x)} = 1\) in Equation (2.19). Since the proposed values for \(\theta^{(t+1)}\) in the above code are independent of \(\theta^{(t)}\), this is an independent Random Walk Metropolis-Hastings algorithm:

2.3.2 Independent Random Walk Metropolis

Algorithm 6 (C. P. Robert and Casella 2009)

Step 1: Generate \(Y_t \sim g\left(y\right)\).
Step 2: Set

\[ \theta^{(t+1)} = \begin{cases} Y_t &\text{with probability } \min \bigg\{ \frac{f(y)}{f(x)}, 1 \bigg\} \\ \theta^{(t)} &\text{with probability } 1 - \min \bigg\{ \frac{f(y)}{f(x)}, 1 \bigg\} \end{cases} \]

2.3.3 Posterior Intervals

We now turn our attention to the output matrix \(\boldsymbol{\theta}_{N \times d}\) and define the following scalar valued functions \(\psi = \psi(\boldsymbol{\theta})\) (Ruppert and Matteson 2011):

\[\begin{align} & \bar{\psi} = \frac{1}{N}\sum \psi_i && s_{\psi} = \left[\frac{1}{N-1}\sum \left(\psi_i - \bar{\psi}\right)^2\right]^{1/2} \tag{2.20} \end{align}\]

where \(\psi_i = \psi_i(\boldsymbol{\theta}_i), \forall i \in \{ 1, \dots, N\}\). The statistics defined in Equation (2.20) are the MCMC sample mean \(\psi_i\), also called the posterior expectation \(E(\psi \vert \boldsymbol{y})\), and the Bayesian standard error \(s_{\psi}\). The ergodic theorem (C. P. Robert and Casella 2009) allow us to create the following interval for \(\boldsymbol{\theta}\):

\[\begin{align} \bar{\psi} \pm z_{\alpha/2} s_{\psi} \end{align}\]

2.3.4 Convergence

The convergence of the Markov chain is guaranteed when following the standard versions of the Metropolis-Hastings algorithm. However, serious mathematical issues arise when the posterior distribution (2.17) is not properly defined, i.e., when the following condition is not respected (C. Robert 2007):

\[\begin{align} \int_{\Theta} \pi (\boldsymbol{\theta}) f(\boldsymbol{y} \vert \boldsymbol{\theta}) d \boldsymbol{\theta} < \infty \tag{2.21} \end{align}\]

As a consequence, the posterior distribution is not integrable and the chains cannot converge. To avoid this issue and since the use of improper priors is behind the improper posteriors, we choose to work with uninformative priors following an integrable uniform distribution. Autocorrelation plots are a popular tool for visually verifying the convergence of a particular sample. For this, we use the coda::acfplot function from the coda package. However, Equation (2.21) must always hold true for the algorithm to be of any use in inference.

2.4 Adaptive MCMC Algorithms

Adaptive MCMC algorithms are used to speed up convergence and improve mixing of the Markov chains. At each step of the simulation, a decision will be made concerning the parameters of the proposal function \(q(x, y)\). For instance, increasing the variance \(\sigma^2\) of a normal proposal would result in larger steps being taken when going from \(\theta^{(t)} \rightarrow \theta^{(t+1)}\). The adaptation step uses the chain’s acceptance rate to decide whether a larger or a smaller step must be taken, i.e., the value of \(\sigma_{t+1}\) in Equation (2.22).

2.4.1 Adaptive RW-Metropolis

A standard case of the adaptive MCMC family is the adaptive Random Walk Metropolis Hastings algorithm with proposal density \(q(x, y) = N(x, \sigma^2 I_d)\).

Algorithm (Atchadé and Rosenthal 2005)

Step 1: Generate \(Y_t \sim N\left(\theta^{(t)}, \sigma_t^2 I \right)\).
Step 2: Set

\[ \theta^{(t+1)} = \begin{cases} Y_t &\text{with probability } \min \bigg\{ \frac{f(y)}{f(x)}, 1 \bigg\}\\ \theta^{(t)} &\text{with probability } 1 - \min \bigg\{ \frac{f(y)}{f(x)}, 1 \bigg\} \tag{2.22} \end{cases} \]

Step 3: Compute

\[\begin{align} \sigma_{t+1}: \epsilon_1 \leq \sigma_t + \gamma_t(\rho(\theta^{(t)} , Y_{t})-\bar{\tau}) \leq A_1 \tag{2.23} \end{align}\]

Where \(0 < \epsilon_1 < A_1\), \(\rho(x,y) = \min \{ {f(y)}/{f(x)}, 1 \}\), \((\gamma_t)\) is a positive sequence of real numbers and \(\bar{\tau}\) is the acceptance rate in stationarity. The source code implementing this algorithm can be found in the caviarma::AMCMC() function, where the lower_bound and upper_bound variables correspond to \(\epsilon_1\) and \(A_1\) in Equation (2.23), i.e., the limits of \(\sigma_{t}\). Note that either a Normal or a Student-t distribution can be used for the proposal since both are centrally symmetric functions.

2.4.2 Robust Adaptive Metropolis

This was the best performing algorithm for estimating the \(\text{CAViaR}\) model, both in terms of speed and precision. The Robust Adaptive Metropolis (RAM) algorithm avoids the use of the empirical covariance matrix of \(\boldsymbol{\theta}_{N \times d}\), which avoids problems caused by target functions without finite second moment.

RAM Algorithm (Vihola 2012)

Step 1: Compute \(Y_t = \theta^{(t-1)} + S_{t-1}U_{t}\), where \(U \overset{\text{iid}}{\sim} q\).
Step 2: Set

\[ \theta^{(t+1)} = \begin{cases} Y_t &\text{with probability } \min \bigg\{ \frac{f(Y_t)}{f(\theta^{(t-1)})}, 1 \bigg\}\\ \theta^{(t)} &\text{with probability } 1 - \min \bigg\{ \frac{f(Y_t)}{f(\theta^{(t-1)})}, 1 \bigg\} \tag{2.24} \end{cases} \]

Step 3: Compute the \(S_t\) matrix satisfying:

\[\begin{align} S_{t}S_{t}^T &=S_{t-1}\left( I + \eta_t(\rho(\theta^{(t-1)}, Y_t) - \alpha_{\star}) \frac{U_tU_t^T}{\lVert U_t \rVert ^ 2} \right) S_{t-1}^T \tag{2.25} \end{align}\]

Where \(I \in \mathbb{R}^{d \times d}\) is the identity matrix, \(q\) is a spherically symmetric proposal density, \(S_t \in \mathbb{R}^{d \times d}\) is a lower-diagonal matrix, \(\{\eta_t\}_{t \geq 1} \subset (0,1]\) is a sequence decaying to zero and \(\alpha_{\star}\) is the target acceptance rate of the algorithm, i.e., \(0.234\) for \(d = 6\) (Gelman, Gilks, and Roberts 1997).

The following C++ implementation is nothing but a translation of the R implementation available through the adaptMCCM package (Scheidegger 2021). Here we chose to drop any outlier parameter vector which might eventually arise due to memory overflow in the Step 3 of the RAM algorithm. In the implementation below, it is important to note that M always exists since it is the Cholesky decomposition of the right-hand side of Equation (2.25), verified to be symmetric and positive definite in Proposition 1 of Vihola (2012).

for (int i = 1; i < nsim; i++) { // Sampling
  U = rt(theta_d, nu);
  Y = x.row(i - 1).t() + (S * U);
  p_val_prop = test_mode ? arma_p_log(Y): armaBetaPosterior(Y);
  double prob = exp(p_val_prop - p_val(i - 1));
  double alpha = std::min(1.0, prob);
  if (!std::isfinite(alpha)) alpha = 0;
  if (arma::randu() < alpha) {
    x.row(i) = Y.t();  // Accept
    p_val(i) = p_val_prop;
    k++;
  } else {
    x.row(i) = x.row(i - 1);
    p_val(i) = p_val(i - 1);
  }
  ii = i + n_start;
  if (ii < n_adapt) {
    adapt_rate = std::min(1.0, theta_d * std::pow(ii, -gamma));
    M = S * (I_mat + adapt_rate * (alpha - acc_rate) * U * U.t() /
             arma::accu(arma::pow(U, 2))) * S.t();
    eig_val = arma::eig_gen(M);
    tol = M.n_cols * arma::max(arma::abs(eig_val)) * double_eps;
    if (!M.is_sympd() || !arma::imag(eig_val).is_zero() ||
        !arma::all(arma::real(eig_val) > tol)) {
      // Shouldn't happen as per Equation (1) in Vihola (2012)
      // If this  happens due to memory overflow the vector is dropped in R
    }
    S = arma::chol(M).t();
  }
}  // End of Sampling

2.4.3 Automatically Tuned Adaptive Metropolis

The last adaptive MCMC algorithm that will be tested was introduced by Yang and Rosenthal (2017) and is available through the atmcmc package (Yang 2014). It works by first using a Metropolis within Gibbs strategy (Roberts and Rosenthal 2009) as with the C++ implementation of Equation (2.22) above, this first step is called the first adaptive phase. Once an optimal one dimensional acceptance rate of \(0.44\) (Roberts and Rosenthal 2009) has been reached by \(\theta_{j}, \forall j \in \{1,2, \dots, d\}\), a new transient phase starts where a non-adaptive Metropolis within Gibbs algorithm is used until the chain reaches the mode of the target distribution. For this, a linear regression of the chain values is used. Once every coordinate \(\theta_{j}\) becomes flat, which has been diagnosed by the aforementioned linear regression, the second adaptive phase of the algorithm begins during which the full proposal matrix \(\Sigma_p\) is updated. Finally, a last sampling phase takes place using a conventional non-adaptive Metropolis algorithm.


  1. We could also use histograms and quantile-quantile plots to better visualize this discrepancy.↩︎

  2. In practice, the natural logarithm of the likelihood function is used for easier computation and better floating point precision.↩︎

  3. The pso::psoptim() and GA::ga() algorithms require minimum input, but may take longer to converge.↩︎