We will numerically verify the gradientn() function for two hyper2 objects.

Chess dataset

The chess dataset is as follows:

chess

## log(Anand^36 * (Anand + Karpov)^-35 * (Anand + Topalov)^-35 * Karpov^22
## * (Karpov + Topalov)^-18 * Topalov^30)

We will define two gradient functions: chess_gradient_theoretical(), which calculates the partial derivatives using straightforward algebra (specifically elementary calculus, \(\partial(x+y)^n/\partial x=n(x+y)^{n-1}\)), and chess_gradient_theoretical(), which uses C code:

chess_gradient_theoretical <- function(x){
    Topalov <- x[1]
    Anand   <- x[2]
    Karpov  <- x[3]

    out <- c(
        Topalov =  30/Topalov -35/(Topalov+Anand) -18/(Topalov+Karpov),
        Anand   = -35/(Topalov+Anand) +36/Anand -35/(Anand+Karpov),
        Karpov  = -18/(Topalov+Karpov) -35/(Anand+Karpov) +22/Karpov
    )

    names(out) <- pnames(chess)
    out

}
chess_gradient_numerical  <- function(x){gradientn(chess,x)}

So in the above we have two ways of calculating the gardient: one theoretical and one numerical. Do they agree?

p_chess <- 
    c(
        Topalov = 0.5,
        Anand   = 0.3,
        Karpov  = 0.2
    )
chess_gradient_theoretical(p_chess) - chess_gradient_numerical(p_chess)

##     Topalov       Anand      Karpov 
##  7.1054e-15  0.0000e+00 -7.1054e-15

Yes, they agree

Icons

The icons dataset is:

icons

## log(L^24 * (L + NB + OA + THC)^-20 * (L + NB + OA + WAIS)^-9 * (L + NB
## + THC + WAIS)^-15 * (L + OA + PB + THC)^-11 * (L + OA + PB + WAIS)^-18
## * (L + PB + THC + WAIS)^-16 * NB^32 * (NB + OA + PB + THC)^-18 * (NB +
## OA + PB + WAIS)^-8 * (NB + PB + THC + WAIS)^-18 * OA^14 * PB^30 *
## THC^24 * WAIS^9)

We can do the same thing but the algebra is more involved:

icons_gradient_theoretical <- function(x){
    NB <- x[1]
    L <- x[2]
    PB <- x[3]
    THC <- x[4]
    OA <- x[5]
    WAIS <- x[6]

    out <- c(
        NB = (
            +32/NB
            -20/(NB + L + THC + OA)
            -15/(NB + L + THC + WAIS)
            -09/(NB + L + OA + WAIS)
            -18/(NB + PB + THC + OA)
            -18/(NB + PB + THC + WAIS)
            -08/(NB + PB + OA + WAIS)
        ),
        L = (
            -20/(NB + L + THC + OA)
            -15/(NB + L + THC + WAIS)
            -09/(NB + L + OA + WAIS)
            +24/L
            -11/(L + PB + THC + OA)
            -16/(L + PB + THC + WAIS)
            -18/(L + PB + OA + WAIS)
        ),
        PB = (
            -18/(NB + PB + THC + OA)
            -18/(NB + PB + THC + WAIS)
            -08/(NB + PB + OA + WAIS)
            -11/(L + PB + THC + OA)
            -16/(L + PB + THC + WAIS)
            -18/(L + PB + OA + WAIS)
            +30/PB
        ),
        THC = (
            -20/(NB + L + THC + OA)
            -15/(NB + L + THC + WAIS)
            -18/(NB + PB + THC + OA)
            -18/(NB + PB + THC + WAIS)
            -11/(L + PB + THC + OA)
            -16/(L + PB + THC + WAIS)
            +24/THC
        ),
        OA = (
            -20/(NB + L + THC + OA)
            -09/(NB + L + OA + WAIS)
            -18/(NB + PB + THC + OA)
            -08/(NB + PB + OA + WAIS)
            -11/(L + PB + THC + OA)
            -18/(L + PB + OA + WAIS)
            +14/OA
        ),
        WAIS = (
            -15/(NB + L + THC + WAIS)
            -09/(NB + L + OA + WAIS)
            -18/(NB + PB + THC + WAIS)
            -08/(NB + PB + OA + WAIS)
            -16/(L + PB + THC + WAIS)
            -18/(L + PB + OA + WAIS)
            +9/WAIS
        )
    )
    names(out) <- pnames(icons)
    return(out)
}

icons_gradient_numerical  <- function(x){gradientn(icons,x)}

Then

p_icons <- c(NB=0.3, L=0.1, PB=0.2, THC=0.15, OA=0.15, WAIS=0.1)

icons_gradient_theoretical(p_icons)

##      NB       L      PB     THC      OA    WAIS 
## -15.995  94.354  12.682  14.427 -33.312 -43.408

icons_gradient_theoretical(p_icons) - icons_gradient_numerical(p_icons)

##          NB           L          PB         THC          OA        WAIS 
## -1.7764e-15 -2.8422e-14  0.0000e+00  2.8422e-14  1.4211e-14  0.0000e+00

also agreeing.

Difference between `gradientn()` and `gradient()`

Taking the chess dataset we have

rbind(indep(p_chess), gradient(chess,indep(p_chess)))

##      Topalov   Anand
## [1,]    0.50  0.3000
## [2,]  -23.75 -8.0357

rbind(p_chess,gradientn(chess,p_chess))

##         Topalov Anand Karpov
## p_chess  0.5000  0.30  0.200
##         -9.4643  6.25 14.286

Indeed, because the powers of chess sum to zero, we can test the fact that the derivative in the direction parallel to p_chess is zero:

sum(powers(chess))

## [1] 0

sum(p_chess*gradientn(chess,p_chess))

## [1] -3.9968e-15

which is correct to numerical precision. Further,

sum(powers(icons))

## [1] 0

sum(p_icons*gradientn(icons,p_icons))

## [1] 3.9968e-15

Case study: icons

Consider the icons likelihood function; we wonder whether the loglikelihood function has a well-defined maximum.

(M <- hessian(icons))

##               NB           L          PB         THC          OA       WAIS
## NB   -319.065887  9.8551e+01  8.5074e+01  1.4049e+02  114.534926 -0.0024850
## L      98.551121 -5.8016e+02  1.1724e+02  1.3727e+02  142.418155 -0.0029326
## PB     85.074225  1.1724e+02 -3.9248e+02  1.2979e+02  127.744304 -0.0007062
## THC   140.485292  1.3727e+02  1.2979e+02 -6.2558e+02   95.252161 -0.0086116
## OA    114.534926  1.4242e+02  1.2774e+02  9.5252e+01 -950.377863  0.0030350
## WAIS   -0.002485 -2.9326e-03 -7.0620e-04 -8.6116e-03    0.003035  0.0000000

is_ok_hessian(M)

## [1]  7.1264e-02 -2.7282e+01  1.8247e+04 -7.3179e+06

This would suggest that the maximum is defined; we may ask how sharp it is:

is_ok_hessian(M,give=TRUE)

## [1]  7.1264e-02 -2.7282e+01  1.8247e+04 -7.3179e+06

suggesting a reasonably sharp constrained maximum point. We now demostrate that the Hessian matrix of second derivatives is numerically correct. First the unconstrained problem, working with the first five elements. The first step is to look at the likelihood, the gradient, and the Hessian, all evaluated at a certain point:

(p <- indep(equalp(icons))) # NB not the evaluate

##      NB       L      PB     THC      OA 
## 0.16667 0.16667 0.16667 0.16667 0.16667

(G <- gradient(icons,p))

## [1] 132.0  82.5 118.5  69.0  30.0

(M <- hessian(icons,p,border=FALSE))

##         [,1]    [,2]    [,3]    [,4]    [,5]
## [1,] -954.00   99.00   99.00  159.75  123.75
## [2,]   99.00 -663.75  101.25  139.50  130.50
## [3,]   99.00  101.25 -879.75  141.75  123.75
## [4,]  159.75  139.50  141.75 -643.50  110.25
## [5,]  123.75  130.50  123.75  110.25 -315.00

eigen(M,TRUE,TRUE)$values

## [1]  -104.36  -570.37  -787.75  -954.26 -1039.26

dp <- rnorm(5)*1e-4        # small perturbation
dL <- loglik(p+dp,icons) - loglik(p,icons)  # delta loglikelihood (exact)

dL1 <- sum(gradient(icons,p)*dp)    # first order approximation
dL2 <- dL1 + t(dp) %*% M %*% dp/2   # second order; should be quad.form(dp,M)

c(dL,dL1,dL2)

## [1] 0.039719 0.039761 0.039747

c(first_order_error=dL-dL1,second_order_error=dL-dL2)

##  first_order_error second_order_error 
##        -4.2731e-05        -2.8448e-05

In the above, see how the first order approximation is pretty good: it is of \({\mathcal O}(f''(x)\delta x^2)\) which here would be about \(1000\times 10^{-8}=10^{-5}\) which is pretty much what we observe [the 1000 is the typical size of the elements of the Hessian]. The second-order approximation is better, having a smaller error … but not that much better. I would expect the error to be \({\mathcal O}(f'''(x)\delta x^3)\) but maybe the third order derivatives are larger than one might expect. We can perform a similar analysis but using bordered Hessians:

dpc <- fillup(dp,icons)   # constrained delta p

MC <- hessian(icons,p,border=TRUE)
dLC1 <- sum(gradientn(icons,fillup(p))*dpc)    # first order approximation
dLC2 <- dLC1 + t(dpc) %*% MC %*% dpc/2     # second order

c(dL,dLC1,dLC2)

## [1]   0.039719 -71.960239 -71.920507

c(first_order_error=dL-dLC1,second_order_error=dL-dLC2)

##  first_order_error second_order_error 
##              72.00              71.96

showing similar results.

Numerical verification of function gradientn()

Chess dataset

Icons

Difference between gradientn() and gradient()

Case study: icons

Numerical verification of function `gradientn()`

Difference between `gradientn()` and `gradient()`