E.Yılmaz; "Evaluation Metrics, Relevance Judgments...", 25.3., 13:40
  • FENS
  • E.Yılmaz; "Evaluation Metrics, Relevance Judgments...", 25.3., 13:40

You are here

Faculty of Engineering and Natural Sciences







Evaluation Metrics, Relevance Judgments and Learning to Rank



Emine Yılmaz


Microsoft Research Cambridge



Most current methods for building search engines are based on the assumption that there is a target evaluation metric that evaluates the quality of the search engine with respect to an end user and given some documents judged for relevance, the engine should be trained to optimize for that metric. Treating the target evaluation metric and the relevance judgments as a given, many different approaches (e.g. LambdaRank, SoftRank, RankingSVM, etc.) have been proposed to develop methods for optimizing for retrieval metrics.



In the first half of this talk, I will discuss the effect of the target evaluation metric on learning to rank. In particular, I will question the current assumption that retrieval systems should be designed to directly optimize for a metric that is assumed to evaluate user satisfaction. I will show that even if user satisfaction can be measured by a metric X, optimizing the engine on a training set for a more informative metric Y may result in a better test performance according to X (as compared to optimizing the engine directly for X on the training set). I will also analyze the situations as to when there is a significant difference in the two cases in terms of the amount of available training data and the number of dimensions of the feature space.



In the second half, I will focus on the expense of obtaining relevance judgments for computing the value of the evaluation metric during learning to rank. I will describe a method based on sampling that can be used to reduce the number of judgments needed for computing the value of the target evaluation metric and show that it is more effective than any other methods previously used for this purpose.



Short Bio:


Emine Yılmaz is a PostDoc Researcher in the Information Retrieval and Analysis group at MSR Cambridge. Her current work mostly focuses on evaluation of search engine quality, learning to rank and the effect of evaluation metrics on learning to rank. Before joining MSR, she was a PhD student at Northeastern University where she developed sampling based methods for efficient retrieval evaluation. Sampling has now become a standardmethod used in evaluation used by most tracks in TREC (Text REtrieval Conference) and INEX (Initiative for Evaluation of XML Retrieval).




March 25, 2009, 13:40, FENS G025