To cite the hyper2 package in publications, please use Hankin (2017).

Here I show how to analyse two datasets from the Preflib project: a netflix dataset and a Debian leader dataset. Note that the preftable class does not really work for data of this type as it cannot cope with the first entry of the data rows corresponding to the number of times each order statistcis was observed.

The first dataset is from netflix, due to Bennett et al. (2007), previously considered by Turner et al. (2020). It is part of the Preflib project, available at https://preflib.simonrey.fr/dataset/00004. The original dataset looks like this:

# FILE NAME: 00004-00000101.soc
# TITLE: Netflix Prize Data
# DESCRIPTION: 
# DATA TYPE: soc
# MODIFICATION TYPE: induced
# RELATES TO: 
# RELATED FILES: 
# PUBLICATION DATE: 2013-08-17
# MODIFICATION DATE: 2022-09-16
# NUMBER ALTERNATIVES: 4
# NUMBER VOTERS: 1256
# NUMBER UNIQUE ORDERS: 24
# ALTERNATIVE NAME 1: The Wedding Planner
# ALTERNATIVE NAME 2: Entrapment
# ALTERNATIVE NAME 3: Lost in Translation
# ALTERNATIVE NAME 4: The Exorcist
228: 4,3,2,1
220: 3,4,2,1
169: 4,2,1,3
98: 4,2,3,1
78: 4,1,2,3
64: 4,3,1,2
63: 2,4,1,3
62: 3,4,1,2
47: 2,1,4,3
41: 1,2,4,3
28: 3,2,4,1
26: 1,4,2,3
23: 2,1,3,4
20: 3,2,1,4
16: 4,1,3,2
15: 1,2,3,4
14: 2,4,3,1
14: 3,1,2,4
10: 3,1,4,2
9: 2,3,1,4
4: 2,3,4,1
4: 1,4,3,2
2: 1,3,2,4
1: 1,3,4,2

It is not in a form amenable to the high-level built-in functions of the hyper2 package (such as wikitable_to_ranktable()); here I show how to create a Plackett-Luce likelihood function for the data from scratch.

films <- c("TWP","En","LiT","TE")
M <- matrix(c(
4,3,2,1,3,4,2,1,4,2,1,3,4,2,3,1,4,1,2,3,4,3,1,2,2,4,1,3,
3,4,1,2,2,1,4,3,1,2,4,3,3,2,4,1,1,4,2,3,2,1,3,4,3,2,1,4,
4,1,3,2,1,2,3,4,2,4,3,1,3,1,2,4,3,1,4,2,2,3,1,4,2,3,4,1,
1,4,3,2,1,3,2,4,1,3,4,2),byrow=TRUE,ncol=4)
n <- c(228,220,169,98,78,64,63,62,47,41,28,26,23,20,16,15,14,14,10,9,4,4,2,1)
head(M)

##      [,1] [,2] [,3] [,4]
## [1,]    4    3    2    1
## [2,]    3    4    2    1
## [3,]    4    2    1    3
## [4,]    4    2    3    1
## [5,]    4    1    2    3
## [6,]    4    3    1    2

Now use race() to convert to a likelihood function:

H <- hyper2()
for(i in seq_along(n)){
    H <- H + race(films[M[i,]])*n[i]
}
H

## log(En^1099 * (En + LiT)^-124 * (En + LiT + TE)^-89 * (En + LiT + TE + TWP)^-1256 * (En + LiT + TWP)^-653 *
## (En + TE)^-27 * (En + TE + TWP)^-354 * (En + TWP)^-574 * LiT^832 * (LiT + TE)^-126 * (LiT + TE + TWP)^-160
## * (LiT + TWP)^-344 * TE^1173 * (TE + TWP)^-61 * TWP^664)

Then we can perform the standard analyses:

maxp(H)

##      En     LiT      TE     TWP 
## 0.20665 0.18121 0.50786 0.10428

equalp.test(H)

## 
##  Constrained support maximization
## 
## data:  H
## null hypothesis: En = LiT = TE = TWP
## null estimate:
##   En  LiT   TE  TWP 
## 0.25 0.25 0.25 0.25 
## (argmax, constrained optimization)
## Support for null:  -3991.6 + K
## 
## alternative hypothesis:  sum p_i=1 
## alternative estimate:
##      En     LiT      TE     TWP 
## 0.20665 0.18121 0.50786 0.10428 
## (argmax, free optimization)
## Support for alternative:  -3564.1 + K
## 
## degrees of freedom: 3
## support difference = 427.55
## p-value: 4.8748e-185

0.1 Debian leader dataset

This dataset is in a slightly different format

# FILE NAME: 00018-00000002.soi
[snip]
# ALTERNATIVE NAME 1: "Nancy Bernard"
# ALTERNATIVE NAME 2: "John Butler"
# ALTERNATIVE NAME 3: "John Erwin"
# ALTERNATIVE NAME 4: "Bob Fine"
# ALTERNATIVE NAME 5: "Mary Merrill Anderson"
# ALTERNATIVE NAME 6: "Tom Nordyke"
# ALTERNATIVE NAME 7: "David Wahstedt"
# ALTERNATIVE NAME 8: "Annie Young"
# ALTERNATIVE NAME 9: "Write In"
3761: 4
2065: 8
1570: 7
1484: 5
1104: 3
1095: 3,8,6
947: 6
735: 1
615: 3,5,6
[snip]
1: 9,3,5
1: 9,7,2
1: 2,9
1: 9,3
1: 9,7
1: 9,5

The inferences from this dataset are different, as the voters were allowed a maximum of three candidates. Thus the first row of data, viz 3761: 4, indicates that 3761 voters registered candidate number 4 [who was Bob Fine] as their first (and only) choice. A few lines down we see 1095: 3,8,6 which means that 1095 voters registered candidate number 3 as their favourite, followed by number 8, then number 6. A few voters included four candidates on their return. Dealing with this in hyper2 is possible but messy. In a hidden chunk we convert the data to a list and show the first three elements:

n[1:3]

## [1] 3761 2065 1570

L[1:3]

## [[1]]
## [1] "Fine"
## 
## [[2]]
## [1] "Young"
## 
## [[3]]
## [1] "Wahstedt"

Now to convert it to a likelihood function we need to treat each vector as the finishers of a race; elements not present are considerd to be “non finishers”, as in Formula 1. Thus c(2,6,3) corresponds to

\[ \frac{p_2}{p_1+p_2+p_3+p_4+p_5+p_6+p_7+p_8+p_9}\cdot \frac{p_6}{p_1+ p_3+p_4+p_5+p_6+p_7+p_8+p_9}\cdot \frac{p_3}{p_1+ p_3+p_4+p_5+ p_7+p_8+p_9} \]

This is a straighforward loop:

H <- hyper2()
for(i in seq_along(n)){
    H <- H + n[i]*race(L[[i]],nonfinishers=setdiff(L[[i]],names))
}

Then apply the usual tests to it:

maxp(H)

## Anderson  Bernard   Butler    Erwin     Fine       In  Nordyke Wahstedt    Young 
## 0.116476 0.089129 0.094808 0.161712 0.163811 0.033764 0.090131 0.120139 0.130030

equalp.test(H)

## 
##  Constrained support maximization
## 
## data:  H
## null hypothesis: Anderson = Bernard = Butler = Erwin = Fine = In = Nordyke = Wahstedt = Young
## null estimate:
## Anderson  Bernard   Butler    Erwin     Fine       In  Nordyke Wahstedt    Young 
##  0.11111  0.11111  0.11111  0.11111  0.11111  0.11111  0.11111  0.11111  0.11111 
## (argmax, constrained optimization)
## Support for null:  -38459 + K
## 
## alternative hypothesis:  sum p_i=1 
## alternative estimate:
## Anderson  Bernard   Butler    Erwin     Fine       In  Nordyke Wahstedt    Young 
## 0.116476 0.089129 0.094808 0.161712 0.163811 0.033764 0.090131 0.120139 0.130030 
## (argmax, free optimization)
## Support for alternative:  -37650 + K
## 
## degrees of freedom: 8
## support difference = 809.91
## p-value: 0

References

Bennett, James, Stan Lanning, et al. 2007. “The Netflix Prize.” Proceedings of KDD Cup and Workshop, 35.

Hankin, R. K. S. 2017. “Partial Rank Data with the hyper2 Package: Likelihood Functions for Generalized Bradley-Terry Models.” The R Journal 9 (2): 429–39.

Turner, H. L., Jacob van Etten, David Firth, and Ioannis Kosmidis. 2020. “Modelling Rankings in R: The PlackettLuce Package.” Computational Statistics 35: 1027–57. https://doi.org/10.1007/s00180-020-00959-3.

Analysing `preflib` datasets with the `hyper2` package: netflix preferences and the Debian 2002 leader dataset

R. K. S. Hankin

0.1 Debian leader dataset

References

Analysing preflib datasets with the hyper2 package: netflix preferences and the Debian 2002 leader dataset

R. K. S. Hankin

0.1 Debian leader dataset

References

Analysing `preflib` datasets with the `hyper2` package: netflix preferences and the Debian 2002 leader dataset