Much of this vignette was originally published as Braun (2014). Please cite that article when using this package in your own research. This version of the vignette is abridged with respect to the underlying theory, and the comparison with other methods. It has more of a focus on how to use the package.

Why use trustOptim?

The need to optimize continuous nonlinear functions occurs frequently in statistics, most notably in maximum likelihood and maximum a posteriori (MAP) estimation. Users of R have a choice of dozens of optimization algorithms. The most readily available algorithms are those that are accessed from the optim function in the base R distribution, and from the many contributed packages that are described in the CRAN Task View for Optimization and Mathematical Programming (Theussel 2013). Any particular algorithm may be more appropriate for some problems than for others, and having such a large number of alternatives allows the informed R user to choose the best tool for the task at hand.

One limitation of most of these algorithms is that they can be difficult to use when there is a large number of decision variables. Search methods can be inefficient with a massive number of parameters because the search space is large, and they do not exploit information about slope and curvature to speed up the time to convergence. Conjugate gradient and quasi-Newton methods trace the curvature of the function by using successive gradients to approximate the inverse Hessian. However, if the algorithm stores the entire dense inverse Hessian, its use is resource-intensive when the number of parameters is large. For example, the Hessian for a 50,000 parameter model requires 20GB of RAM to store it as a standard, dense base R matrix.

The trustOptim package provides a trust region algorithm that is optimized for problems for which the Hessian is sparse. Sparse Hessians occur when a large number of the cross-partial derivatives of the objective function are zero. For example, suppose we want to find the mode of a log posterior density for a Bayesian hierarchical model. If we assume that individual-level parameter vectors βi\beta_i and βj\beta_j are conditionally independent, the cross-partial derivatives between all elements of βi\beta_i and βj\beta_j are zero. If the model includes a large number of heterogeneous units, and a relatively small number of population-level parameters, the proportion of non-zero entries in the Hessian will be small. Since we know up front which elements of the Hessian are non-zero, we need to compute, store, and operate on only those non-zero elements. By storing sparse Hessians in a compressed format, and using a library of numerical algorithms that are efficient for sparse matrices, we can run the optimization algorithms faster, with a smaller memory footprint, than algorithms that operate on dense Hessians.

The details of the trust region algorithm are included at the end of this vignette. The vignette for the sparseHessianFD package includes a more detailed discussion of Hessian sparsity patterns.

Example function

Before going into the details of how to use the package, let’s consider the following example of an objective function with a sparse Hessian. Suppose we have a dataset of NN households, each with TT opportunities to purchase a particular product. Let yiy_i be the number of times household ii purchases the product, out of the TT purchase opportunities. Furthermore, let pip_i be the probability of purchase; pip_i is the same for all TT opportunities, so we can treat yiy_i as a binomial random variable. The purchase probability pip_i is heterogeneous, and depends on both kk continuous covariates xix_i, and a heterogeneous coefficient vector βi\beta_i, such that pi=exp(xiβi)1+exp(xiβi),i=1...N p_i=\frac{\exp(x_i'\beta_i)}{1+\exp(x_i'\beta_i)},~i=1 ... N

The coefficients can be thought of as sensitivities to the covariates, and they are distributed across the population of households following a multivariate normal distribution with mean μ\mu and covariance Σ\Sigma. We assume that we know Σ\Sigma, but we do not know μ\mu. Instead, we place a multivariate normal prior on μ\mu, with mean 00 and covariance Ω0\Omega_0. Thus, each βi\beta_i, and μ\mu are kk-dimensional vectors, and the total number of unknown variables in the model is (N+1)k(N+1)k.

The log posterior density, ignoring any normalization constants, is logπ(β1:N,μ|Y,X,Σ0,Ω0)=i=1Npiyi(1pi)Tyi12(βiμ)Σ1(βiμ)12μΩ01μ \log \pi(\beta_{1:N},\mu|Y, X, \Sigma_0,\Omega_0)=\sum_{i=1}^Np_i^{y_i}(1-p_i)^{T-y_i} -\frac{1}{2}\left(\beta_i-\mu\right)'\Sigma^{-1}\left(\beta_i-\mu\right) -\frac{1}{2}\mu'\Omega_0^{-1}\mu

Since the βi\beta_i are drawn iid from a multivariate normal, 2logπβiβj=0\dfrac{\partial^2\log\pi }{\partial\beta_i\beta_j}=0 for all iji\neq j. We also know that all of the βi\beta_i are correlated with μ\mu. The structure of the Hessian depends on how the variables are ordered within the vector. One such ordering is to group all of the coefficients for each unit together.

β11,...,β1k,β21,...,β2k,...,...,βN1,...,βNk,μ1,...,μk \beta_{11},...,\beta_{1k},\beta_{21},...,\beta_{2k},...~,~...~,~\beta_{N1}~,~...~,~\beta_{Nk},\mu_1,...,\mu_k

In this case, the Hessian has a “block-arrow” structure. For example, if N=6N=6 and k=2k=2, then there are 14 total variables, and the Hessian will have the following pattern.

# 14 x 14 sparse Matrix of class "lgCMatrix"
#                                  
#  [1,] | | . . . . . . . . . . | |
#  [2,] | | . . . . . . . . . . | |
#  [3,] . . | | . . . . . . . . | |
#  [4,] . . | | . . . . . . . . | |
#  [5,] . . . . | | . . . . . . | |
#  [6,] . . . . | | . . . . . . | |
#  [7,] . . . . . . | | . . . . | |
#  [8,] . . . . . . | | . . . . | |
#  [9,] . . . . . . . . | | . . | |
# [10,] . . . . . . . . | | . . | |
# [11,] . . . . . . . . . . | | | |
# [12,] . . . . . . . . . . | | | |
# [13,] | | | | | | | | | | | | | |
# [14,] | | | | | | | | | | | | | |

There are 196 elements in this symmetric matrix, but only 76 are non-zero, and only 45 values are unique. Although the reduction in RAM from using a sparse matrix structure for the Hessian may be modest, consider what would happen if N=1000N=1000 instead. In that case, there are 2002 variables in the problem, and more than 44 million elements in the Hessian. However, only 1200412004 of those elements are non-zero. If we work with only the lower triangle of the Hessian we only need to work with only 7003 values.

Using the package

The functions for computing the objective function, gradient and Hessian for this example are in the R/binary.R file. The package also includes a sample dataset with simulated data from the binary choice example. This dataset can be access with the data(binary) call.

To start, we load the data, set some dimension parameters, set prior values for Σ1\Sigma^{-1} and Ω1\Omega^{-1}, and simulate a vector of variables at which to evaluate the function.

set.seed(123)
data(binary)
str(binary)
# List of 3
#  $ Y: int [1:200] 16 1 30 70 51 52 0 27 59 15 ...
#  $ X: num [1:2, 1:200] 1.5587 0.0705 0.1293 1.7151 0.4609 ...
#  $ T: num 100
N <- length(binary$Y)
k <- NROW(binary$X)
nvars <- as.integer(N*k + k)
start <- rnorm(nvars) ## random starting values
priors <- list(inv.Sigma = rWishart(1,k+5,diag(k))[,,1],
               inv.Omega = diag(k))

This dataset represents the simulated choices for N=200N= 200 customers over T=100T= 100 purchase opportunties, where the probability of purchase is influenced by k=2k= 2 covariates.

The objective function for the binary choice example is binary.f, the gradient function is binary.grad, and the Hessian function is binary.hess. The first argument to both is the variable vector, and the argument lists must be the same for both. For this example, we need to provide the data list “binary” (XX, YY and TT) and the prior parameter list (Σ1\Sigma^{-1} and Ω1\Omega^{-1}). The binary.hess function returns the Hessian as a dgCMatrix object, which is a compressed sparse matrix class defined in the Matrix package.

opt <- trust.optim(start, fn=binary.f,
                   gr = binary.grad,
                   hs = binary.hess,
                   method = "Sparse",
                   control = list(
                       start.trust.radius=5,
                       stop.trust.radius = 1e-7,
                       prec=1e-7,
                       report.precision=1L,
                       maxit=500L,
                       preconditioner=1L,
                       function.scale.factor=-1
                   ),
                   data=binary, priors=priors
                   )
# Beginning optimization
# 
# iter       f     nrm_gr                     status
#   1   15977.1    769.5     Continuing - TR expand
#   2   12359.7    278.2                 Continuing
#   3   12359.7    278.2   Continuing - TR contract
#   4   12059.7    261.9                 Continuing
#   5   12059.7    261.9   Continuing - TR contract
#   6   11670.0    310.8                 Continuing
#   7   11670.0    310.8   Continuing - TR contract
#   8   11440.0     67.5     Continuing - TR expand
#   9   11304.5      7.2     Continuing - TR expand
#  10   11303.8      1.0                 Continuing
#  11   11303.8      0.0                 Continuing
#  12   11303.8      0.0                 Continuing
#  13   11303.8      0.0                 Continuing
# 
# Iteration has terminated
#  13   11303.8      0.0                    Success

Control parameters

The control argument takes a list of options, all of which are described in the package manual. Most of these arguments are related to the internal workings of the trust region algorithm, such as how close a step needs to be to the border of the trust region before the region expands. However, there are a few arguments that deserve some special attention.

Scaling the objective function

The algorithms in the package minimize the objective function by default. When the function.scale.factor option is specified, the objective function, gradient and Hessian are all multiplied by that value throughout the optimization procedure. If function.scale.factor=-1, then then trust.optim will maximize the objective function.

Stopping criteria

The trust.optim function will stop when the Euclidean norm of the gradient is less that sqrt(nvars) * prec, where nvars is the length of the parameter vector, and prec is specified in the control list (the default is sqrt(.Machine\$double.eps), which is the square root of the computer’s floating point precision. However, sometimes the algorithm cannot get the gradient to be that flat. When that occurs, the trust region will shrink, until its radius is less than the value of the stop.trust.radius parameter. The algorithm will then stop with the message “Radius of trust region is less than stop.trust.radius.” This event is not necessarily a problem if the norm of the gradient is still small enough that the gradient is flat for all practical purposes. For example, suppose we set prec to be 10710^{-7} and that, for numerical reasons, the norm of the gradient simply cannot get below 10610^{-6}. If the norm of the gradient were the only stopping criterion, the algorithm would continue to run, even though it is probably close enough to the local optimum. With this alternative stopping criterion, the algorithm will also stop when it is clear that the algorithm can no longer take a step that leads to an improvement in the objective function.

There is, of course, a third stopping criterion. The maxit argument is the maximum number of iterations the algorithm should run before stopping. However, keep in mind that if the algorithm stops at maxit, it is almost certainly not at a local optimum.

Note that many other nonlinear optimizers, including optim, do not use the norm of the gradient as a stopping criterion. Instead, they stop when the absolute or relative changes in the objective function are less than some tolerance value. This often causes those optimizers to stop prematurely, when the estimates of the gradient and/or Hessian are not precise, or if there are some regions of the domain where the objective function is nearly flat. In theory, this should never happen, but in reality, it happens all the time. For an unconstrained optimization problem, there is no reason why the norm of the gradient should not be zero (within numerical precision) before the algorithm stops.

Preconditioners

Currently, the package offers two preconditioners: an identity preconditioner (no preconditioning), and an inexact modified Cholesky preconditioner, as in Algorithm 7.3 of Nocedal and Wright (2006). The identity and diagonal preconditioners are available for all of the methods. For the Sparse method, the modified Cholesky preconditioner will use a positive definite matrix that is close to the potentially indefinite Hessian (trust.optim does not require that the objective function be positive definite). For BFGS, the modified Cholesky preconditioner is available because BFGS updates are always positive definite. If the user selects a modified Cholesky preconditioner for SR1, the algorithm will use the identity preconditioner instead.

There is no general rule for selecting preconditioners. There will be a tradeoff between the number of iterations needs to solve the problem and the time it takes to compute any particular preconditioner. In some cases, the identity preconditioner may even solve the problem in fewer iterations than a modified Cholesky preconditioner.

Result object

The call ot trust.optim returns a list of values.

  • fval: the value of the objective function at the optimum
  • solution: the optimum
  • gradient: the gradient of the objective function at the optimum (all elements should be very close to zero)
  • hessian: the Hessian of the objective function at the optimum, as an object of class dsCMatrix.
  • iterations: number of iterations
  • status: A status message (should be “Success”), or possibly a note that the trust region radius is less than stop.trust.region.
  • trust.radius: trust region radius when the algorithm stopped.
  • nnz: number of nonzero elements in the lower triangle of the Hessian
  • method: the optimization method that was used (Sparse, SR1 or BFGS).
  • nnz: for the Sparse method only, the number of nonzero elements in the Hessian.

See the package manual for more details.

Algorithmic details

Consider f(x)f(x), an objective function over a PP-dimensional vector that we want to minimize. Let gg be the gradient, and let BB be the Hessian. The goal is to find a local minimum of f(x)f(x), with no constraints on xx, within some window of numerical precision (i.e., where g2/p<ϵ\|g\|_2 / \sqrt{p}<\epsilon for small ϵ>0\epsilon>0). We will assume that BB is positive definite at the local optimum, but not necessarily at other values of xx. Iterations are indexed by tt.

Trust region methods for nonlinear optimization

The details of trust region methods are described in both Nocedal and Wright (2006) and Conn et al. (2000), and the following exposition borrows heavily from both sources. At each iteration of a trust region algorithm, we construct a quadratic approximation to the objective function at xtx_t, and minimize that approximation, subject to a constraint that the solution falls within a trust region with radius dtd_t. More formally, each iteration of the trust region algorithm involves solving the “trust region subproblem,” or TRS.

minsRpf*(s)=f(xt)+gts+12sBtss.t. sMdtst=argminsRpf*(s)s.t. sMdt \begin{align} \tag{TRS}\label{eq:TRS} \min_{s\in R^p} f^*(s)& = f(x_t) + g_t^\top s + \frac{1}{2}s^\top B_ts\qquad\text{s.t. }\|s\|_M\leq d_t\\ s_t&=\arg\min_{s\in R^p} f^*(s) \qquad\text{s.t. }\|s\|_M\leq d_t \end{align}

The norm M\|\cdot\|_M is a Mahalanobis norm with respect to some positive definite matrix MM.

Let sts_t be the solution to the for iteration tt, and consider the ratio ρt=f(xt)f(xt+st)f*(xt)f*(xt+st)\begin{align} \label{eq:2} \rho_t=\frac{f(x_t)-f(x_t+s_t)}{f^*(x_t)-f^*(x_t+s_t)} \end{align} This ratio is the improvement in the objective function that we would get from a move from xtx_t to xt+1x_{t+1}, where xt+1=xt+stx_{t+1}=x_t+s_t, relative to the improvement that is predicted by the quadratic approximation. Let η1\eta_1 be the minimum value of ρt\rho_t for which we deem it “worthwhile” to move from xtx_t to xt+1x_{t+1}, and let η2\eta_2 be the maximum ρt\rho_t that would trigger a shrinkage in the trust region. If ρt<η2\rho_t < \eta_2, or if f(xt+st)f(x_t+s_t) is not finite, we shrink the trust region by reducing dtd_t by some predetermined factor, and compute a new sts_t by solving the again. If ρt>η1\rho_t>\eta_1, we move to xt+1=xt+stx_{t+1}=x_t+s_t. Also, if we do accept the move, and sts_t is on the border of the trust region, we expand the trust region by increasing dtd_t, again by some predetermined factor. The idea is to not move to a new xx if f(xt+1)f(x_{t+1}) would be worse than f(xt)f(x_t). By expanding the trust region, we can propose larger jumps, and potentially reach the optimum more quickly. We want to propose only moves that are among those that we “trust” to give reasonable values of f(x)f(x). If it turns out that a move leads to a large improvement in the objective function, and that the proposed move was constrained by the radius of the trust region, we want to expand the trust region so we can take larger steps. If the proposed move is bad, we should then reduce the size of the region we trust, and try to find another step that is closer to the current iterate. Of course, there is no reason that the trust region needs to change after a particular iteration, especially if the solution to the TRS is at an internal point.

There are a number of different ways to solve the TRS; Conn et al. (2000) is authoritative and encyclopedic in this area. The trustOptim package uses the method described in Steihaug (1983). The Steihaug algorithm is, essentially, a conjugate gradient solver for a constrained quadratic program. If BtB_t is positive definite, the Steihaug solution to the will be exact, up to some level of numerical precision. However, if BtB_t is indefinite, the algorithm could try to move in a direction of negative curvature. If the algorithm happens to stumble on such a direction, it goes back to the last direction that it moved, runs in that direction to the border of the trust region, and returns that point of intersection with the trust region border as the “solution” to the . This solution is not necessarily the true minimizer of the , but it still might provide sufficient improvement in the objective function such that ρt>η1\rho_t>\eta_1. If not, we shrink the trust region and try again. As an alternative to the Steihaug algorithm for solving the , Conn et al. (2000) suggest using the Lanczos algorithm. The Lanczos approach may be more likely to find a better solution to the TRS when BtB_t is indefinite, but at some additional computational cost. We include only the Steihaug algorithm for now, because it still seems to work well, especially for sparse problems.

As with other conjugate gradient methods, one way to speed up the Steihaug algorithm is to rescale the trust region subproblem with a preconditioner MM. Note that the constraint in is expressed as an MM-norm, rather than an Euclidean norm. The positive definite matrix MM should be close enough to the Hessian that M1BtIM^{-1}B_t\approx I, but still cheap enough to compute that the cost of using the preconditioner does not exceed the benefits. Of course, the ideal preconditioner would be BtB_t itself, but BtB_t is not necessarily positive definite, and we may not be able to estimate it fast enough for preconditioning to be worthwhile. In this case, one could use a modified Cholesky decomposition, as described in Nocedal and Wright (2006).

References

Braun, Michael. 2014. trustOptim: An R Package for Trust Region Optimization with Sparse Hessians.” Journal of Statistical Software 60 (4): 1–16. https://www.jstatsoft.org/v60/i04/.
Conn, Andrew R, Nicholas I M Gould, and Philippe L Toint. 2000. Trust-Region Methods. SIAM-MPS.
Nocedal, Jorge, and Stephen J Wright. 2006. Numerical Optimization. Second edition. Springer-Verlag.
Steihaug, Trond. 1983. “The Conjugate Gradient Method and Trust Regions in Large Scale Optimization.” SIAM Journal on Numerical Analysis 20 (3): 626–37. https://doi.org/10.1137/0720042.
Theussel, Stefan. 2013. CRAN Task View: Optimization and Mathematical Programming.” https://cran.r-project.org/view=Optimization.