preflib datasets with the hyper2 package: netflix preferences and the Debian 2002 leader datasetTo cite the hyper2 package in publications, please use
Hankin (2017).
Here I show how to analyse two datasets from the Preflib project: a
netflix dataset and a Debian leader dataset. Note that the
preftable class does not really work for data of this type as it
cannot cope with the first entry of the data rows corresponding to the
number of times each order statistcis was observed.
The first dataset is from netflix, due to Bennett, Lanning, et al. (2007),
previously considered by Turner et al. (2020). It is part of the Preflib
project, available at https://preflib.simonrey.fr/dataset/00004. The
original dataset looks like this:
# FILE NAME: 00004-00000101.soc
# TITLE: Netflix Prize Data
# DESCRIPTION:
# DATA TYPE: soc
# MODIFICATION TYPE: induced
# RELATES TO:
# RELATED FILES:
# PUBLICATION DATE: 2013-08-17
# MODIFICATION DATE: 2022-09-16
# NUMBER ALTERNATIVES: 4
# NUMBER VOTERS: 1256
# NUMBER UNIQUE ORDERS: 24
# ALTERNATIVE NAME 1: The Wedding Planner
# ALTERNATIVE NAME 2: Entrapment
# ALTERNATIVE NAME 3: Lost in Translation
# ALTERNATIVE NAME 4: The Exorcist
228: 4,3,2,1
220: 3,4,2,1
169: 4,2,1,3
98: 4,2,3,1
78: 4,1,2,3
64: 4,3,1,2
63: 2,4,1,3
62: 3,4,1,2
47: 2,1,4,3
41: 1,2,4,3
28: 3,2,4,1
26: 1,4,2,3
23: 2,1,3,4
20: 3,2,1,4
16: 4,1,3,2
15: 1,2,3,4
14: 2,4,3,1
14: 3,1,2,4
10: 3,1,4,2
9: 2,3,1,4
4: 2,3,4,1
4: 1,4,3,2
2: 1,3,2,4
1: 1,3,4,2
It is not in a form amenable to the high-level built-in functions of
the hyper2 package (such as wikitable_to_ranktable()); here I show
how to create a Plackett-Luce likelihood function for the data from
scratch.
films <- c("TWP","En","LiT","TE")
M <- matrix(c(
4,3,2,1,3,4,2,1,4,2,1,3,4,2,3,1,4,1,2,3,4,3,1,2,2,4,1,3,
3,4,1,2,2,1,4,3,1,2,4,3,3,2,4,1,1,4,2,3,2,1,3,4,3,2,1,4,
4,1,3,2,1,2,3,4,2,4,3,1,3,1,2,4,3,1,4,2,2,3,1,4,2,3,4,1,
1,4,3,2,1,3,2,4,1,3,4,2),byrow=TRUE,ncol=4)
n <- c(228,220,169,98,78,64,63,62,47,41,28,26,23,20,16,15,14,14,10,9,4,4,2,1)
head(M)
## [,1] [,2] [,3] [,4]
## [1,] 4 3 2 1
## [2,] 3 4 2 1
## [3,] 4 2 1 3
## [4,] 4 2 3 1
## [5,] 4 1 2 3
## [6,] 4 3 1 2
Now use race() to convert to a likelihood function:
H <- hyper2()
for(i in seq_along(n)){
H <- H + race(films[M[i,]])*n[i]
}
H
## log(En^1099 * (En + LiT)^-124 * (En + LiT + TE)^-89 * (En + LiT + TE + TWP)^-1256 * (En + LiT + TWP)^-653 *
## (En + TE)^-27 * (En + TE + TWP)^-354 * (En + TWP)^-574 * LiT^832 * (LiT + TE)^-126 * (LiT + TE + TWP)^-160
## * (LiT + TWP)^-344 * TE^1173 * (TE + TWP)^-61 * TWP^664)
Then we can perform the standard analyses:
maxp(H)
## En LiT TE TWP
## 0.20665 0.18121 0.50786 0.10428
equalp.test(H)
##
## Constrained support maximization
##
## data: H
## null hypothesis: En = LiT = TE = TWP
## null estimate:
## En LiT TE TWP
## 0.25 0.25 0.25 0.25
## (argmax, constrained optimization)
## Support for null: -3991.6 + K
##
## alternative hypothesis: sum p_i=1
## alternative estimate:
## En LiT TE TWP
## 0.20665 0.18121 0.50786 0.10428
## (argmax, free optimization)
## Support for alternative: -3564.1 + K
##
## degrees of freedom: 3
## support difference = 427.55
## p-value: 4.8748e-185
This dataset is in a slightly different format
# FILE NAME: 00018-00000002.soi
[snip]
# ALTERNATIVE NAME 1: "Nancy Bernard"
# ALTERNATIVE NAME 2: "John Butler"
# ALTERNATIVE NAME 3: "John Erwin"
# ALTERNATIVE NAME 4: "Bob Fine"
# ALTERNATIVE NAME 5: "Mary Merrill Anderson"
# ALTERNATIVE NAME 6: "Tom Nordyke"
# ALTERNATIVE NAME 7: "David Wahstedt"
# ALTERNATIVE NAME 8: "Annie Young"
# ALTERNATIVE NAME 9: "Write In"
3761: 4
2065: 8
1570: 7
1484: 5
1104: 3
1095: 3,8,6
947: 6
735: 1
615: 3,5,6
[snip]
1: 9,3,5
1: 9,7,2
1: 2,9
1: 9,3
1: 9,7
1: 9,5
The inferences from this dataset are different, as the voters were
allowed a maximum of three candidates. Thus the first row of data,
viz 3761: 4, indicates that 3761 voters registered candidate number
4 [who was Bob Fine] as their first (and only) choice. A few lines
down we see 1095: 3,8,6 which means that 1095 voters registered
candidate number 3 as their favourite, followed by number 8, then
number 6. A few voters included four candidates on their return.
Dealing with this in hyper2 is possible but messy. In a hidden
chunk we convert the data to a list and show the first three elements:
n[1:3]
## [1] 3761 2065 1570
L[1:3]
## [[1]]
## [1] "Fine"
##
## [[2]]
## [1] "Young"
##
## [[3]]
## [1] "Wahstedt"
Now to convert it to a likelihood function we need to treat each
vector as the finishers of a race; elements not present are considerd
to be “non finishers”, as in Formula 1. Thus c(2,6,3) corresponds
to
\[ \frac{p_2}{p_1+p_2+p_3+p_4+p_5+p_6+p_7+p_8+p_9}\cdot \frac{p_6}{p_1+ p_3+p_4+p_5+p_6+p_7+p_8+p_9}\cdot \frac{p_3}{p_1+ p_3+p_4+p_5+ p_7+p_8+p_9} \]
This is a straighforward loop:
H <- hyper2()
for(i in seq_along(n)){
H <- H + n[i]*race(L[[i]],nonfinishers=setdiff(L[[i]],names))
}
Then apply the usual tests to it:
maxp(H)
## Anderson Bernard Butler Erwin Fine In Nordyke Wahstedt Young
## 0.116473 0.089127 0.094806 0.161706 0.163815 0.033773 0.090129 0.120139 0.130032
equalp.test(H)
##
## Constrained support maximization
##
## data: H
## null hypothesis: Anderson = Bernard = Butler = Erwin = Fine = In = Nordyke = Wahstedt = Young
## null estimate:
## Anderson Bernard Butler Erwin Fine In Nordyke Wahstedt Young
## 0.11111 0.11111 0.11111 0.11111 0.11111 0.11111 0.11111 0.11111 0.11111
## (argmax, constrained optimization)
## Support for null: -38459 + K
##
## alternative hypothesis: sum p_i=1
## alternative estimate:
## Anderson Bernard Butler Erwin Fine In Nordyke Wahstedt Young
## 0.116473 0.089127 0.094806 0.161706 0.163815 0.033773 0.090129 0.120139 0.130032
## (argmax, free optimization)
## Support for alternative: -37650 + K
##
## degrees of freedom: 8
## support difference = 809.91
## p-value: 0
hyper2 Package: Likelihood Functions for Generalized Bradley-Terry Models.” The R Journal 9 (2): 429–39.
PlackettLuce Package.” Computational Statistics 35: 1027–57. https://doi.org/10.1007/s00180-020-00959-3.