(takes a “long time” to run without cache)
To cite the hyper2 package in publications, please use Hankin (2017).
This file creates objects jester and maxjest, which are datasets
available in the hyper2 package, and documented at jester.Rd.
This file takes quite a long time to run. Goldberg et al present a
dataset in which respondents rated a number of jokes. Here, I analyse
a small portion of this dataset using the hyper2 package. This
document is intended to illustrate an extremely challenging
application of the hyper2 package and (without cache) takes a long
time to process. Goldberg’s dataset has 24938 lines, one per
respondent, and 101 columns, one per joke (the first column shows the
number of jokes rated by each respondent); here I use 150 lines and 99
jokes (the 100th joke was not funny).
library("hyper2",quietly=TRUE)
a <- read.csv("jester-data-3.csv",head=FALSE) # File is 150 lines only
a <- as.matrix(a[,-c(1,100)])
a[a==99] <- NA
colnames(a) <- paste("joke",sprintf("%02d",seq_len(ncol(a))),sep="")
a[1,]
## joke01 joke02 joke03 joke04 joke05 joke06 joke07 joke08 joke09 joke10 joke11
## NA NA NA NA -1.65 NA -0.78 6.89 NA NA NA
## joke12 joke13 joke14 joke15 joke16 joke17 joke18 joke19 joke20 joke21 joke22
## NA -2.57 NA -1.31 -0.19 -5.97 2.96 -0.29 1.17 NA NA
## joke23 joke24 joke25 joke26 joke27 joke28 joke29 joke30 joke31 joke32 joke33
## NA NA NA NA 1.55 NA -2.23 NA 0.15 6.26 NA
## joke34 joke35 joke36 joke37 joke38 joke39 joke40 joke41 joke42 joke43 joke44
## NA 1.26 1.26 NA NA NA NA NA NA -7.52 -5.87
## joke45 joke46 joke47 joke48 joke49 joke50 joke51 joke52 joke53 joke54 joke55
## NA NA NA -8.20 NA 4.42 NA NA -3.98 NA NA
## joke56 joke57 joke58 joke59 joke60 joke61 joke62 joke63 joke64 joke65 joke66
## NA NA NA NA NA NA 3.50 NA NA NA -2.14
## joke67 joke68 joke69 joke70 joke71 joke72 joke73 joke74 joke75 joke76 joke77
## 2.23 -2.91 NA NA NA NA NA 1.36 NA NA NA
## joke78 joke79 joke80 joke81 joke82 joke83 joke84 joke85 joke86 joke87 joke88
## NA NA NA NA NA NA NA NA NA NA NA
## joke89 joke90 joke91 joke92 joke93 joke94 joke95 joke96 joke97 joke98 joke99
## NA NA NA NA NA NA NA NA NA NA NA
Row 1 of a is displayed (most entries are NA, signifying that
respondent 1 did not rank that particular joke). It shows that the
first respondent rated joke 5 at -1.65, joke 7 at -0.78, and so on.
We can perform some summary statistics of a:
plot(rowSums(!is.na(a)),xlab="respondent")
The above plot shows how many jokes each of the 100 respondents rated.
plot(colSums(!is.na(a)),xlab="joke index")
The above shows how many respondents rated each joke. It would make sense to remove the jokes that were not rated:
dim(a)
## [1] 150 99
a <- a[,colSums(!is.na(a))>1]
dim(a)
## [1] 150 91
showing that 91 jokes were rated by at least one respondent. We need to transform the dataset:
f <- function(x){
x <- x[!is.na(x)]
x[order(x,decreasing=TRUE)] <- seq_along(x)
return(x)
}
f(a[1,])
## joke05 joke07 joke08 joke13 joke15 joke16 joke17 joke18 joke19 joke20 joke27
## 17 15 1 20 16 13 24 5 14 11 7
## joke29 joke31 joke32 joke35 joke36 joke43 joke44 joke48 joke50 joke53 joke62
## 19 12 2 9 10 25 23 26 3 22 4
## joke66 joke67 joke68 joke74
## 18 6 21 8
Thus we see that this respondent rated joke08 to be the funniest, having rank 1.
jester <- hyper2()
system.time(for(i in seq_len(nrow(a))){jester <- jester + suppfun(f(a[i,]))})
## user system elapsed
## 93.456 0.004 93.469
now
system.time(jester_maxp <- maxp(jester,n=1))
## user system elapsed
## 319.647 0.009 319.687
and
jester_maxp
## joke01 joke02 joke05 joke07 joke08 joke10 joke11 joke13
## 0.0142787 0.0103028 0.0083114 0.0066428 0.0046864 0.0029662 0.0108968 0.0046080
## joke14 joke15 joke16 joke17 joke18 joke19 joke20 joke21
## 0.0060809 0.0052923 0.0039105 0.0054111 0.0072861 0.0075893 0.0061186 0.0164679
## joke22 joke23 joke24 joke25 joke26 joke27 joke28 joke29
## 0.0448125 0.0066555 0.0144195 0.0084493 0.0145402 0.0145014 0.0106450 0.0189412
## joke31 joke32 joke34 joke35 joke36 joke37 joke38 joke39
## 0.0127088 0.0148353 0.0129720 0.0201269 0.0206969 0.0050762 0.0158294 0.0117543
## joke40 joke41 joke42 joke43 joke44 joke45 joke46 joke47
## 0.0095550 0.0057632 0.0112977 0.0076500 0.0049391 0.0070565 0.0133322 0.0080489
## joke48 joke49 joke50 joke51 joke52 joke53 joke54 joke55
## 0.0103990 0.0163925 0.0226517 0.0114101 0.0088278 0.0183375 0.0105215 0.0154453
## joke56 joke57 joke58 joke59 joke60 joke61 joke62 joke63
## 0.0131216 0.0049063 0.0054297 0.0107177 0.0071531 0.0159573 0.0159352 0.0132457
## joke64 joke65 joke66 joke67 joke68 joke69 joke70 joke71
## 0.0056264 0.0136657 0.0151815 0.0072871 0.0123982 0.0135279 0.0065229 0.0061855
## joke72 joke73 joke74 joke75 joke76 joke77 joke78 joke79
## 0.0180365 0.0103434 0.0059028 0.0082545 0.0070830 0.0109528 0.0129611 0.0035139
## joke80 joke81 joke82 joke83 joke84 joke85 joke86 joke87
## 0.0170436 0.0098551 0.0140656 0.0060766 0.0050318 0.0121068 0.0044101 0.0082476
## joke88 joke89 joke90 joke91 joke92 joke93 joke94 joke95
## 0.0100654 0.0187792 0.0144267 0.0092513 0.0034361 0.0137616 0.0166520 0.0086796
## joke96 joke97 joke98
## 0.0099924 0.0150308 0.0097355
plot(jester_maxp)
equalp.test(jester,startp=indep(jester_maxp))
##
## Constrained support maximization
##
## data: jester
## null hypothesis: joke01 = joke02 = joke05 = joke07 = joke08 = joke10 = joke11 = joke13 = joke14 = joke15 = joke16 = joke17 = joke18 = joke19 = joke20 = joke21 = joke22 = joke23 = joke24 = joke25 = joke26 = joke27 = joke28 = joke29 = joke31 = joke32 = joke34 = joke35 = joke36 = joke37 = joke38 = joke39 = joke40 = joke41 = joke42 = joke43 = joke44 = joke45 = joke46 = joke47 = joke48 = joke49 = joke50 = joke51 = joke52 = joke53 = joke54 = joke55 = joke56 = joke57 = joke58 = joke59 = joke60 = joke61 = joke62 = joke63 = joke64 = joke65 = joke66 = joke67 = joke68 = joke69 = joke70 = joke71 = joke72 = joke73 = joke74 = joke75 = joke76 = joke77 = joke78 = joke79 = joke80 = joke81 = joke82 = joke83 = joke84 = joke85 = joke86 = joke87 = joke88 = joke89 = joke90 = joke91 = joke92 = joke93 = joke94 = joke95 = joke96 = joke97 = joke98
## null estimate:
## joke01 joke02 joke05 joke07 joke08 joke10 joke11 joke13
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke14 joke15 joke16 joke17 joke18 joke19 joke20 joke21
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke22 joke23 joke24 joke25 joke26 joke27 joke28 joke29
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke31 joke32 joke34 joke35 joke36 joke37 joke38 joke39
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke40 joke41 joke42 joke43 joke44 joke45 joke46 joke47
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke48 joke49 joke50 joke51 joke52 joke53 joke54 joke55
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke56 joke57 joke58 joke59 joke60 joke61 joke62 joke63
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke64 joke65 joke66 joke67 joke68 joke69 joke70 joke71
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke72 joke73 joke74 joke75 joke76 joke77 joke78 joke79
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke80 joke81 joke82 joke83 joke84 joke85 joke86 joke87
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke88 joke89 joke90 joke91 joke92 joke93 joke94 joke95
## 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989 0.010989
## joke96 joke97 joke98
## 0.010989 0.010989 0.010989
## (argmax, constrained optimization)
## Support for null: -8352.6 + K
##
## alternative hypothesis: sum p_i=1
## alternative estimate:
## joke01 joke02 joke05 joke07 joke08 joke10 joke11 joke13
## 0.0142788 0.0103029 0.0083115 0.0066426 0.0046865 0.0029663 0.0108969 0.0046082
## joke14 joke15 joke16 joke17 joke18 joke19 joke20 joke21
## 0.0060810 0.0052928 0.0039109 0.0054114 0.0072860 0.0075893 0.0061192 0.0164679
## joke22 joke23 joke24 joke25 joke26 joke27 joke28 joke29
## 0.0448125 0.0066556 0.0144200 0.0084494 0.0145403 0.0145015 0.0106451 0.0189413
## joke31 joke32 joke34 joke35 joke36 joke37 joke38 joke39
## 0.0127089 0.0148355 0.0129722 0.0201271 0.0206970 0.0050763 0.0158294 0.0117546
## joke40 joke41 joke42 joke43 joke44 joke45 joke46 joke47
## 0.0095551 0.0057632 0.0112978 0.0076502 0.0049390 0.0070566 0.0133322 0.0080491
## joke48 joke49 joke50 joke51 joke52 joke53 joke54 joke55
## 0.0103991 0.0163926 0.0226519 0.0114100 0.0088279 0.0183376 0.0105216 0.0154454
## joke56 joke57 joke58 joke59 joke60 joke61 joke62 joke63
## 0.0131218 0.0049065 0.0054300 0.0107179 0.0071532 0.0159574 0.0159353 0.0132458
## joke64 joke65 joke66 joke67 joke68 joke69 joke70 joke71
## 0.0056265 0.0136660 0.0151816 0.0072873 0.0123982 0.0135280 0.0065231 0.0061857
## joke72 joke73 joke74 joke75 joke76 joke77 joke78 joke79
## 0.0180365 0.0103435 0.0059029 0.0082546 0.0070831 0.0109530 0.0129614 0.0035140
## joke80 joke81 joke82 joke83 joke84 joke85 joke86 joke87
## 0.0170439 0.0098552 0.0140657 0.0060767 0.0050320 0.0121069 0.0044100 0.0082477
## joke88 joke89 joke90 joke91 joke92 joke93 joke94 joke95
## 0.0100656 0.0187793 0.0144268 0.0092514 0.0034361 0.0137617 0.0166522 0.0086797
## joke96 joke97 joke98
## 0.0099925 0.0150308 0.0097240
## (argmax, free optimization)
## Support for alternative: -8004.5 + K
##
## degrees of freedom: 90
## support difference = 348.14
## p-value: 1.8878e-94
Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.
jester_table <- a
for(i in seq_len(nrow(jester_table))){
x <- jester_table[i,]
x[!is.na(x)] <- rank(-x[!is.na(x)],ties.method="first")
jester_table[i,] <- x
}
jester_table <- t(jester_table)
colnames(jester_table) <-paste("resp",seq_len(ncol(jester_table)),sep="_")
jester_table[1:6,1:10]
## resp_1 resp_2 resp_3 resp_4 resp_5 resp_6 resp_7 resp_8 resp_9 resp_10
## joke01 NA NA NA NA NA NA NA NA 5 NA
## joke02 NA NA NA 10 NA NA NA NA 3 NA
## joke05 17 25 2 17 16 20 4 3 15 13
## joke07 15 24 8 22 10 15 16 23 25 17
## joke08 1 20 15 12 14 24 14 26 32 2
## joke10 NA NA NA NA 22 NA NA NA NA NA
Above we see that respondent 1 ranked joke 8 as the funniest and joke 7 as the 15th funniest.
Comparing with formula1_2022.txt, for example, we see that respondents correspond to venues and jokes correspond to drivers. We have to be a little bit careful because NA means “not rated”, not “did not finish” as in the Formula 1 datasets.
plot(rowMeans(jester_table,na.rm=TRUE))
abline(v=17)
Following lines create jester.rda, residing in the data/ directory of the package.
save(jester,jester_table,jester_maxp,file="jester.rda")
### References
hyper2 Package: Likelihood Functions for Generalized Bradley-Terry Models.” The R Journal 9 (2): 429–39.