gradientn()We will numerically verify the gradientn() function for
two hyper2 objects.
The chess dataset is as follows:
chess
## log(Anand^36 * (Anand + Karpov)^-35 * (Anand + Topalov)^-35 * Karpov^22
## * (Karpov + Topalov)^-18 * Topalov^30)
We will define two gradient functions:
chess_gradient_theoretical(), which calculates the partial
derivatives using straightforward algebra (specifically elementary
calculus, \(\partial(x+y)^n/\partial
x=n(x+y)^{n-1}\)), and chess_gradient_theoretical(),
which uses C code:
chess_gradient_theoretical <- function(x){
Topalov <- x[1]
Anand <- x[2]
Karpov <- x[3]
out <- c(
Topalov = 30/Topalov -35/(Topalov+Anand) -18/(Topalov+Karpov),
Anand = -35/(Topalov+Anand) +36/Anand -35/(Anand+Karpov),
Karpov = -18/(Topalov+Karpov) -35/(Anand+Karpov) +22/Karpov
)
names(out) <- pnames(chess)
out
}
chess_gradient_numerical <- function(x){gradientn(chess,x)}
So in the above we have two ways of calculating the gardient: one theoretical and one numerical. Do they agree?
p_chess <-
c(
Topalov = 0.5,
Anand = 0.3,
Karpov = 0.2
)
chess_gradient_theoretical(p_chess) - chess_gradient_numerical(p_chess)
## Topalov Anand Karpov
## 7.1054e-15 0.0000e+00 -7.1054e-15
Yes, they agree
The icons dataset is:
icons
## log(L^24 * (L + NB + OA + THC)^-20 * (L + NB + OA + WAIS)^-9 * (L + NB
## + THC + WAIS)^-15 * (L + OA + PB + THC)^-11 * (L + OA + PB + WAIS)^-18
## * (L + PB + THC + WAIS)^-16 * NB^32 * (NB + OA + PB + THC)^-18 * (NB +
## OA + PB + WAIS)^-8 * (NB + PB + THC + WAIS)^-18 * OA^14 * PB^30 *
## THC^24 * WAIS^9)
We can do the same thing but the algebra is more involved:
icons_gradient_theoretical <- function(x){
NB <- x[1]
L <- x[2]
PB <- x[3]
THC <- x[4]
OA <- x[5]
WAIS <- x[6]
out <- c(
NB = (
+32/NB
-20/(NB + L + THC + OA)
-15/(NB + L + THC + WAIS)
-09/(NB + L + OA + WAIS)
-18/(NB + PB + THC + OA)
-18/(NB + PB + THC + WAIS)
-08/(NB + PB + OA + WAIS)
),
L = (
-20/(NB + L + THC + OA)
-15/(NB + L + THC + WAIS)
-09/(NB + L + OA + WAIS)
+24/L
-11/(L + PB + THC + OA)
-16/(L + PB + THC + WAIS)
-18/(L + PB + OA + WAIS)
),
PB = (
-18/(NB + PB + THC + OA)
-18/(NB + PB + THC + WAIS)
-08/(NB + PB + OA + WAIS)
-11/(L + PB + THC + OA)
-16/(L + PB + THC + WAIS)
-18/(L + PB + OA + WAIS)
+30/PB
),
THC = (
-20/(NB + L + THC + OA)
-15/(NB + L + THC + WAIS)
-18/(NB + PB + THC + OA)
-18/(NB + PB + THC + WAIS)
-11/(L + PB + THC + OA)
-16/(L + PB + THC + WAIS)
+24/THC
),
OA = (
-20/(NB + L + THC + OA)
-09/(NB + L + OA + WAIS)
-18/(NB + PB + THC + OA)
-08/(NB + PB + OA + WAIS)
-11/(L + PB + THC + OA)
-18/(L + PB + OA + WAIS)
+14/OA
),
WAIS = (
-15/(NB + L + THC + WAIS)
-09/(NB + L + OA + WAIS)
-18/(NB + PB + THC + WAIS)
-08/(NB + PB + OA + WAIS)
-16/(L + PB + THC + WAIS)
-18/(L + PB + OA + WAIS)
+9/WAIS
)
)
names(out) <- pnames(icons)
return(out)
}
icons_gradient_numerical <- function(x){gradientn(icons,x)}
Then
p_icons <- c(NB=0.3, L=0.1, PB=0.2, THC=0.15, OA=0.15, WAIS=0.1)
icons_gradient_theoretical(p_icons)
## NB L PB THC OA WAIS
## -15.995 94.354 12.682 14.427 -33.312 -43.408
icons_gradient_theoretical(p_icons) - icons_gradient_numerical(p_icons)
## NB L PB THC OA WAIS
## -1.7764e-15 -2.8422e-14 0.0000e+00 2.8422e-14 1.4211e-14 0.0000e+00
also agreeing.
gradientn() and
gradient()Taking the chess dataset we have
rbind(indep(p_chess), gradient(chess,indep(p_chess)))
## Topalov Anand
## [1,] 0.50 0.3000
## [2,] -23.75 -8.0357
rbind(p_chess,gradientn(chess,p_chess))
## Topalov Anand Karpov
## p_chess 0.5000 0.30 0.200
## -9.4643 6.25 14.286
Indeed, because the powers of chess sum to zero, we can
test the fact that the derivative in the direction parallel to
p_chess is zero:
sum(powers(chess))
## [1] 0
sum(p_chess*gradientn(chess,p_chess))
## [1] -3.9968e-15
which is correct to numerical precision. Further,
sum(powers(icons))
## [1] 0
sum(p_icons*gradientn(icons,p_icons))
## [1] 3.9968e-15
Consider the icons likelihood function; we wonder
whether the loglikelihood function has a well-defined maximum.
(M <- hessian(icons))
## NB L PB THC OA WAIS
## NB -319.065887 9.8551e+01 8.5074e+01 1.4049e+02 114.534926 -0.0024850
## L 98.551121 -5.8016e+02 1.1724e+02 1.3727e+02 142.418155 -0.0029326
## PB 85.074225 1.1724e+02 -3.9248e+02 1.2979e+02 127.744304 -0.0007062
## THC 140.485292 1.3727e+02 1.2979e+02 -6.2558e+02 95.252161 -0.0086116
## OA 114.534926 1.4242e+02 1.2774e+02 9.5252e+01 -950.377863 0.0030350
## WAIS -0.002485 -2.9326e-03 -7.0620e-04 -8.6116e-03 0.003035 0.0000000
is_ok_hessian(M)
## [1] 7.1264e-02 -2.7282e+01 1.8247e+04 -7.3179e+06
This would suggest that the maximum is defined; we may ask how sharp it is:
is_ok_hessian(M,give=TRUE)
## [1] 7.1264e-02 -2.7282e+01 1.8247e+04 -7.3179e+06
suggesting a reasonably sharp constrained maximum point. We now demostrate that the Hessian matrix of second derivatives is numerically correct. First the unconstrained problem, working with the first five elements. The first step is to look at the likelihood, the gradient, and the Hessian, all evaluated at a certain point:
(p <- indep(equalp(icons))) # NB not the evaluate
## NB L PB THC OA
## 0.16667 0.16667 0.16667 0.16667 0.16667
(G <- gradient(icons,p))
## [1] 132.0 82.5 118.5 69.0 30.0
(M <- hessian(icons,p,border=FALSE))
## [,1] [,2] [,3] [,4] [,5]
## [1,] -954.00 99.00 99.00 159.75 123.75
## [2,] 99.00 -663.75 101.25 139.50 130.50
## [3,] 99.00 101.25 -879.75 141.75 123.75
## [4,] 159.75 139.50 141.75 -643.50 110.25
## [5,] 123.75 130.50 123.75 110.25 -315.00
eigen(M,TRUE,TRUE)$values
## [1] -104.36 -570.37 -787.75 -954.26 -1039.26
dp <- rnorm(5)*1e-4 # small perturbation
dL <- loglik(p+dp,icons) - loglik(p,icons) # delta loglikelihood (exact)
dL1 <- sum(gradient(icons,p)*dp) # first order approximation
dL2 <- dL1 + t(dp) %*% M %*% dp/2 # second order; should be quad.form(dp,M)
c(dL,dL1,dL2)
## [1] 0.039719 0.039761 0.039747
c(first_order_error=dL-dL1,second_order_error=dL-dL2)
## first_order_error second_order_error
## -4.2731e-05 -2.8448e-05
In the above, see how the first order approximation is pretty good: it is of \({\mathcal O}(f''(x)\delta x^2)\) which here would be about \(1000\times 10^{-8}=10^{-5}\) which is pretty much what we observe [the 1000 is the typical size of the elements of the Hessian]. The second-order approximation is better, having a smaller error … but not that much better. I would expect the error to be \({\mathcal O}(f'''(x)\delta x^3)\) but maybe the third order derivatives are larger than one might expect. We can perform a similar analysis but using bordered Hessians:
dpc <- fillup(dp,icons) # constrained delta p
MC <- hessian(icons,p,border=TRUE)
dLC1 <- sum(gradientn(icons,fillup(p))*dpc) # first order approximation
dLC2 <- dLC1 + t(dpc) %*% MC %*% dpc/2 # second order
c(dL,dLC1,dLC2)
## [1] 0.039719 -71.960239 -71.920507
c(first_order_error=dL-dLC1,second_order_error=dL-dLC2)
## first_order_error second_order_error
## 72.00 71.96
showing similar results.