 # Posit AI Weblog: torch for optimization

To this point, all `torch` use circumstances we’ve mentioned right here have been in deep studying. Nonetheless, its computerized differentiation characteristic is helpful in different areas. One distinguished instance is numerical optimization: We will use `torch` to search out the minimal of a operate.

In reality, operate minimization is precisely what occurs in coaching a neural community. However there, the operate in query usually is way too advanced to even think about discovering its minima analytically. Numerical optimization goals at increase the instruments to deal with simply this complexity. To that finish, nonetheless, it begins from capabilities which might be far much less deeply composed. As an alternative, they’re hand-crafted to pose particular challenges.

This submit is a primary introduction to numerical optimization with `torch`. Central takeaways are the existence and usefulness of its L-BFGS optimizer, in addition to the impression of working L-BFGS with line search. As a enjoyable add-on, we present an instance of constrained optimization, the place a constraint is enforced through a quadratic penalty operate.

To heat up, we take a detour, minimizing a operate “ourselves” utilizing nothing however tensors. It will grow to be related later, although, as the general course of will nonetheless be the identical. All modifications can be associated to integration of `optimizer`s and their capabilities.

## Perform minimization, DYI strategy

To see how we will decrease a operate “by hand”, let’s attempt the enduring Rosenbrock function. This can be a operate with two variables:

[
f(x_1, x_2) = (a – x_1)^2 + b * (x_2 – x_1^2)^2
]

, with (a) and (b) configurable parameters typically set to 1 and 5, respectively.

In R:

``````library(torch)

a <- 1
b <- 5

rosenbrock <- operate(x) {
x1 <- x
x2 <- x
(a - x1)^2 + b * (x2 - x1^2)^2
}``````

Its minimal is positioned at (1,1), inside a slender valley surrounded by breakneck-steep cliffs: Determine 1: Rosenbrock operate.

Our aim and technique are as follows.

We need to discover the values (x_1) and (x_2) for which the operate attains its minimal. We’ve got to begin someplace; and from wherever that will get us on the graph we observe the detrimental of the gradient “downwards”, descending into areas of consecutively smaller operate worth.

Concretely, in each iteration, we take the present ((x1,x2)) level, compute the operate worth in addition to the gradient, and subtract some fraction of the latter to reach at a brand new ((x1,x2)) candidate. This course of goes on till we both attain the minimal – the gradient is zero – or enchancment is beneath a selected threshold.

Right here is the corresponding code. For no particular causes, we begin at `(-1,1)` . The educational charge (the fraction of the gradient to subtract) wants some experimentation. (Attempt 0.1 and 0.001 to see its impression.)

``````num_iterations <- 1000

# fraction of the gradient to subtract
lr <- 0.01

# operate enter (x1,x2)
# that is the tensor w.r.t. which we'll have torch compute the gradient
x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)

for (i in 1:num_iterations) {

if (i %% 100 == 0) cat("Iteration: ", i, "n")

# name operate
worth <- rosenbrock(x_star)
if (i %% 100 == 0) cat("Worth is: ", as.numeric(worth), "n")

# compute gradient of worth w.r.t. params
worth\$backward()

# guide replace
})
}``````
``````Iteration:  100
Worth is:  0.3502924

Iteration:  200
Worth is:  0.07398106

...
...

Iteration:  900
Worth is:  0.0001532408

Iteration:  1000
Worth is:  6.962555e-05

Whereas this works, it actually serves as an instance the precept. With `torch` offering a bunch of confirmed optimization algorithms, there isn’t a want for us to manually compute the candidate (mathbf{x}) values.

## Perform minimization with `torch` optimizers

As an alternative, we let a `torch` optimizer replace the candidate (mathbf{x}) for us. Habitually, our first attempt is Adam.

With Adam, optimization proceeds rather a lot quicker. Reality be instructed, although, selecting studying charge nonetheless takes non-negligeable experimentation. (Attempt the default studying charge, 0.001, for comparability.)

``````num_iterations <- 100

x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)

lr <- 1

for (i in 1:num_iterations) {

if (i %% 10 == 0) cat("Iteration: ", i, "n")

worth <- rosenbrock(x_star)
if (i %% 10 == 0) cat("Worth is: ", as.numeric(worth), "n")

worth\$backward()
optimizer\$step()

}``````
``````Iteration:  10
Worth is:  0.8559565

Iteration:  20
Worth is:  0.1282992

...
...

Iteration:  90
Worth is:  4.003079e-05

Iteration:  100
Worth is:  6.937736e-05

It took us a few hundred iterations to reach at a good worth. This can be a lot quicker than the guide strategy above, however nonetheless rather a lot. Fortunately, additional enhancements are doable.

### L-BFGS

Among the many many `torch` optimizers generally utilized in deep studying (Adam, AdamW, RMSprop …), there’s one “outsider”, significantly better identified in basic numerical optimization than in neural-networks area: L-BFGS, a.okay.a. Limited-memory BFGS, a memory-optimized implementation of the Broyden–Fletcher–Goldfarb–Shanno optimization algorithm (BFGS).

BFGS is probably essentially the most broadly used among the many so-called Quasi-Newton, second-order optimization algorithms. Versus the household of first-order algorithms that, in deciding on a descent route, make use of gradient data solely, second-order algorithms moreover take curvature data into consideration. To that finish, precise Newton strategies really compute the Hessian (a pricey operation), whereas Quasi-Newton strategies keep away from that value and, as a substitute, resort to iterative approximation.

Wanting on the contours of the Rosenbrock operate, with its extended, slender valley, it isn’t troublesome to think about that curvature data would possibly make a distinction. And, as you’ll see in a second, it actually does. Earlier than although, one be aware on the code. When utilizing L-BFGS, it’s essential to wrap each operate name and gradient analysis in a closure (`calc_loss()`, within the beneath snippet), for them to be callable a number of occasions per iteration. You possibly can persuade your self that the closure is, in truth, entered repeatedly, by inspecting this code snippet’s chatty output:

``````num_iterations <- 3

x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)

optimizer <- optim_lbfgs(x_star)

calc_loss <- operate() {

worth <- rosenbrock(x_star)
cat("Worth is: ", as.numeric(worth), "n")

worth\$backward()
worth

}

for (i in 1:num_iterations) {
cat("Iteration: ", i, "n")
optimizer\$step(calc_loss)
}``````
``````Iteration:  1
Worth is:  4

Worth is:  6

...
...

Worth is:  0.04880721

Worth is:  0.0302862

Iteration:  2
Worth is:  0.01697086

Worth is:  0.01124081

...
...

Worth is:  1.111701e-09

Worth is:  4.547474e-12

Iteration:  3
Worth is:  4.547474e-12

Though we ran the algorithm for 3 iterations, the optimum worth actually is reached after two. Seeing how nicely this labored, we attempt L-BFGS on a tougher operate, named flower, for fairly self-evident causes.

## (But) extra enjoyable with L-BFGS

Right here is the flower operate. Mathematically, its minimal is close to `(0,0)`, however technically the operate itself is undefined at `(0,0)`, for the reason that `atan2` used within the operate is just not outlined there.

``````a <- 1
b <- 1
c <- 4

flower <- operate(x) {
a * torch_norm(x) + b * torch_sin(c * torch_atan2(x, x))
}`````` Determine 2: Flower operate.

We run the identical code as above, ranging from `(20,20)` this time.

``````num_iterations <- 3

x_star <- torch_tensor(c(20, 0), requires_grad = TRUE)

optimizer <- optim_lbfgs(x_star)

calc_loss <- operate() {

worth <- flower(x_star)
cat("Worth is: ", as.numeric(worth), "n")

worth\$backward()

cat("X is: ", as.matrix(x_star), "nn")

worth

}

for (i in 1:num_iterations) {
cat("Iteration: ", i, "n")
optimizer\$step(calc_loss)
}``````
``````Iteration:  1
Worth is:  28.28427
X is:  20 20

...
...

Worth is:  19.33546
X is:  12.957 14.68274

...
...

Worth is:  18.29546
X is:  12.14691 14.06392

...
...

Worth is:  9.853705
X is:  5.763702 8.895616

Worth is:  2635.866
X is:  -1949.697 -1773.551

Iteration:  2
Worth is:  1333.113
X is:  -985.4553 -897.5367

Worth is:  30.16862
X is:  -21.02814 -21.72296

Worth is:  1281.39
X is:  964.0121 843.7817

Worth is:  628.1306
X is:  475.7051 409.7372

Worth is:  4965690
X is:  -3721262 -3287901

Worth is:  2482306
X is:  -1862675 -1640817

Worth is:  8.61863e+11
X is:  645200412672 571423064064

Worth is:  430929412096
X is:  322643460096 285659529216

Worth is:  Inf
X is:  -2.826342e+19 -2.503904e+19

Iteration:  3
Worth is:  Inf
X is:  -2.826342e+19 -2.503904e+19 ``````

This has been much less of a hit. At first, loss decreases properly, however all of the sudden, the estimate dramatically overshoots, and retains bouncing between detrimental and optimistic outer area ever after.

Fortunately, there’s something we will do.

Taken in isolation, what a Quasi-Newton methodology like L-BFGS does is decide one of the best descent route. Nonetheless, as we simply noticed, route is just not sufficient. With the flower operate, wherever we’re, the optimum path results in catastrophe if we keep on it lengthy sufficient. Thus, we want an algorithm that fastidiously evaluates not solely the place to go, but additionally, how far.

For that reason, L-BFGS implementations generally incorporate line search, that’s, a algorithm indicating whether or not a proposed step size is an effective one, or ought to be improved upon.

Particularly, `torch`’s L-BFGS optimizer implements the Strong Wolfe conditions. We re-run the above code, altering simply two strains. Most significantly, the one the place the optimizer is instantiated:

``optimizer <- optim_lbfgs(x_star, line_search_fn = "strong_wolfe")``

And secondly, this time I discovered that after the third iteration, loss continued to lower for some time, so I let it run for 5 iterations. Right here is the output:

``````Iteration:  1
...
...

Worth is:  -0.8838741
X is:  0.09035123 -0.03220009

Worth is:  -0.928809
X is:  0.06564617 -0.026706

Iteration:  2
...
...

Worth is:  -0.9991404
X is:  0.0006493925 -0.0002656128

Worth is:  -0.9992246
X is:  0.0007130796 -0.0002947929

Iteration:  3
...
...

Worth is:  -0.9997789
X is:  0.0002042478 -8.457939e-05

Worth is:  -0.9998025
X is:  0.0001822711 -7.553725e-05

Iteration:  4
...
...

Worth is:  -0.9999917
X is:  -6.320081e-06 2.614706e-06

Worth is:  -0.9999923
X is:  -6.921942e-06 2.865841e-06

Iteration:  5
...
...

Worth is:  -0.9999999
X is:  -7.267168e-08 3.009783e-08

Worth is:  -0.9999999
X is:  -7.404627e-08 3.066708e-08 ``````

It’s nonetheless not excellent, however rather a lot higher.

Lastly, let’s go one step additional. Can we use `torch` for constrained optimization?

### Quadratic penalty for constrained optimization

In constrained optimization, we nonetheless seek for a minimal, however that minimal can’t reside simply wherever: Its location has to satisfy some variety of extra circumstances. In optimization lingo, it must be possible.

For instance, we stick with the flower operate, however add on a constraint: (mathbf{x}) has to lie exterior a circle of radius (sqrt(2)), centered on the origin. Formally, this yields the inequality constraint

[
2 – {x_1}^2 – {x_2}^2 <= 0
]

A technique to decrease flower and but, on the identical time, honor the constraint is to make use of a penalty operate. With penalty strategies, the worth to be minimized is a sum of two issues: the goal operate’s output and a penalty reflecting potential constraint violation. Use of a quadratic penalty, for instance, ends in including a a number of of the sq. of the constraint operate’s output:

``````# x^2 + y^2 >= 2
# 2 - x^2 - y^2 <= 0
constraint <- operate(x) 2 - torch_square(torch_norm(x))

penalty <- operate(x) torch_square(torch_max(constraint(x), different = 0))``````

A priori, we will’t know the way large that a number of must be to implement the constraint. Due to this fact, optimization proceeds iteratively. We begin with a small multiplier, (1), say, and improve it for so long as the constraint remains to be violated:

``````penalty_method <- operate(f, p, x, k_max, rho = 1, gamma = 2, num_iterations = 1) {

for (okay in 1:k_max) {
cat("Beginning step: ", okay, ", rho = ", rho, "n")

decrease(f, p, x, rho, num_iterations)

cat("Worth: ",  as.numeric(f(x)), "n")
cat("X: ",  as.matrix(x), "n")

current_penalty <- as.numeric(p(x))
cat("Penalty: ", current_penalty, "n")
if (current_penalty == 0) break

rho <- rho * gamma
}

}``````

`decrease()`, referred to as from `penalty_method()`, follows the same old proceedings, however now it minimizes the sum of the goal and up-weighted penalty operate outputs:

``````decrease <- operate(f, p, x, rho, num_iterations) {

calc_loss <- operate() {
worth <- f(x) + rho * p(x)
worth\$backward()
worth
}

for (i in 1:num_iterations) {
cat("Iteration: ", i, "n")
optimizer\$step(calc_loss)
}

}``````

This time, we begin from a low-target-loss, however unfeasible worth. With one more change to default L-BFGS (specifically, a lower in tolerance), we see the algorithm exiting efficiently after twenty-two iterations, on the level `(0.5411692,1.306563)`.

``````x_star <- torch_tensor(c(0.5, 0.5), requires_grad = TRUE)

optimizer <- optim_lbfgs(x_star, line_search_fn = "strong_wolfe", tolerance_change = 1e-20)

penalty_method(flower, penalty, x_star, k_max = 30)``````
``````Beginning step:  1 , rho =  1
Iteration:  1
Worth:  0.3469974
X:  0.5154735 1.244463
Penalty:  0.03444662

Beginning step:  2 , rho =  2
Iteration:  1
Worth:  0.3818618
X:  0.5288152 1.276674
Penalty:  0.008182613

Beginning step:  3 , rho =  4
Iteration:  1
Worth:  0.3983252
X:  0.5351116 1.291886
Penalty:  0.001996888

...
...

Beginning step:  20 , rho =  524288
Iteration:  1
Worth:  0.4142133
X:  0.5411959 1.306563
Penalty:  3.552714e-13

Beginning step:  21 , rho =  1048576
Iteration:  1
Worth:  0.4142134
X:  0.5411956 1.306563
Penalty:  1.278977e-13

Beginning step:  22 , rho =  2097152
Iteration:  1
Worth:  0.4142135
X:  0.5411962 1.306563
Penalty:  0 ``````

## Conclusion

Summing up, we’ve gotten a primary impression of the effectiveness of `torch`’s L-BFGS optimizer, particularly when used with Sturdy-Wolfe line search. In reality, in numerical optimization – versus deep studying, the place computational pace is way more of a difficulty – there’s infrequently a purpose to not use L-BFGS with line search.

We’ve then caught a glimpse of how you can do constrained optimization, a activity that arises in lots of real-world functions. In that regard, this submit feels much more like a starting than a stock-taking. There’s a lot to discover, from common methodology match – when is L-BFGS nicely suited to an issue? – through computational efficacy to applicability to completely different species of neural networks. Evidently, if this evokes you to run your personal experiments, and/or should you use L-BFGS in your personal tasks, we’d love to listen to your suggestions!

Thanks for studying!

## Appendix

### Rosenbrock operate plotting code

``````library(tidyverse)

a <- 1
b <- 5

rosenbrock <- operate(x) {
x1 <- x
x2 <- x
(a - x1)^2 + b * (x2 - x1^2)^2
}

df <- expand_grid(x1 = seq(-2, 2, by = 0.01), x2 = seq(-2, 2, by = 0.01)) %>%
rowwise() %>%
mutate(x3 = rosenbrock(c(x1, x2))) %>%
ungroup()

ggplot(knowledge = df,
aes(x = x1,
y = x2,
z = x3)) +
geom_contour_filled(breaks = as.numeric(torch_logspace(-3, 3, steps = 50)),
present.legend = FALSE) +
theme_minimal() +
scale_fill_viridis_d(route = -1) +
theme(facet.ratio = 1)``````

### Flower operate plotting code

``````a <- 1
b <- 1
c <- 4

flower <- operate(x) {
a * torch_norm(x) + b * torch_sin(c * torch_atan2(x, x))
}

df <- expand_grid(x = seq(-3, 3, by = 0.05), y = seq(-3, 3, by = 0.05)) %>%
rowwise() %>%
mutate(z = flower(torch_tensor(c(x, y))) %>% as.numeric()) %>%
ungroup()

ggplot(knowledge = df,
aes(x = x,
y = y,
z = z)) +
geom_contour_filled(present.legend = FALSE) +
theme_minimal() +
scale_fill_viridis_d(route = -1) +
theme(facet.ratio = 1)``````

Picture by Michael Trimble on Unsplash

Share