Ranking From Pairs and Triplets: Information Quality, Evaluation Methods and Query Complexity

Obtaining judgments from human raters is a vital part in the design of search engines’ evaluation. Today, there exists a discrepancy between judgment acquisition from raters (training phase) and use of the responses for retrieval evaluation (evaluation phase). This discrepancy is derived from the inconsistency between the representation of the information in both phases. During training, raters are requested to provide a relevance score for an individual result in the context of a query, whereas the evaluation is performed on ordered lists of search results, with the results’ relative position (compared to other results) is taken into account. As an alternative to the practice of learning to rank using relevance judgments for individual search results, more and more focus has recently been diverted to the theory and practice of learning from answers to combinatorial questions about sets of search results. That is, users, during training, are asked to rank small sets (typically pairs). We start by statistically comparing human rater response to questions on relevance of individual results versus questions on pairs of results. We empirically show that neither type of response can be deduced from the other, and that the added context created when results are shown together changes raters’ evaluation process. Since pairwise judgments are directly related to ranking, we conclude they are more accurate for that purpose. We go beyond pairs to show that triplets do not contain significantly much more information than pairs for the purpose of measuring statistical preference. These two results establish good stability properties of pairwise comparisons for the purpose of learning to rank. We further analyze different scenarios, in which results of varying quality are added as “decoy”. A recurring source of worry in papers focusing on pairwise comparison is the quadratic number of pairs in a set of results. Which preferences do we choose to solicit from paid raters? Can we provably eliminate a quadratic cost? We initiate a rigorous statistical learning theoretical study in the question of which and how many pairs we need to choose, and prove a result indicating that only O(n polylog n) pairs are required in order to obtain an almost perfect ordering. These pairs are chosen adaptively, and hence our scheme suggests an active learning algorithm.