According to the paper, traditional evaluation measures in information retrieval (IR) are mainly the quality and the utility of the search and recommendation results. However, these traditional metrics do not take fairness of information exposure into account such as relevance, diversity, novelty. Rather, the traditional metrics focus on maximizing the user's utility based on the user's information need.
Gao discusses prior research on the fairness of exposure in IR which focuses on the perspective of information exposure regarding demographics such as gender and race. Those studies suggest systems in which items receive exposure proportional to their utilities depending on which exposure distribution is considered fair by the designer.
This paper suggests a new metric called FAIR (Fairness-Aware IR) which unifies and balances traditional IR metrics and fairness measures. FAIR takes relevance, diversity, novelty, or search intent into account discounted by the position bias and diversity bias. The integrated FAIR approach adopts distribution-based, ratio-based, or error-based fairness metrics. It also claims unifying fairness and utility will result in benefiting each other instead of interfering with each other on optimization. Thus, optimizing FAIR will balance any existing trade-off between utility and fairness and optimize both of them. On the other hand, a low FAIR score indicates either the utility or fairness is low, or both.
2. Research Questions
What is an ideal metric that balances utility and fairness in information retrieval?
3. Data and Methods
- Google Search: Dataset contain 100 queries and 7,410 webpages crawled from a snapshot of Google Web search.
- ClueWeb09: Dataset used in the TREC 2009 Web Track, commonly used in the diversity-related retrieval tasks. These data contain roughly 1 billion web pages comprising approximately 25TB of uncompressed data in multiple languages, crawled from the Web during January and February 2009.
- MovieLens: Public benchmark dataset commonly used in the recommendation literature, especially in recent fair recommendation research.
Methods and Metrics
- position discounted KL-divergence
- Invariant Risk Minimization (IRM)
- Epsilon-Greedy Algorithm
- Mean Average Precision (MAP)
- Normalized Discounted Cumulative Gain (nDCG)
- Rank-Biased Precision (RBP)
4. Summary of Findings
FAIR combines the essential factors in IR (relevance, diversity, and novelty) discounted by the position bias and diversity bias. FAIR aims to jointly optimize and balance both the utility and fairness of IR. Experiments and tests in this paper showed that, compared to the baseline algorithm, FAIR optimization algorithm was able to achieve better utility as well as fairness. FAIR rewards the algorithms that perform well in both fairness and utility. Ideally, the high FAIR score should correspond to high user utility and also fairly expose items.
Experiments show that among four datasets, MovieLens had the worst utility scores due to its extreme sparsity. FAIR captured this since it assigned the lowest score. ClueWeb09 had slightly better relevance scores than Google Search, yet much worse fairness scores. FAIR captured this as well since it gave higher FAIR scores on Google and Google+DetCons than on ClueWeb09.
FAIR reflected for all datasets and the given ranking algorithms, as k increased, both utility and fairness metric scores improved. Moreover, FAIR highlighted cases where both utility and fairness were good and gave higher weights on the dimension with higher performances. Experiments show that when attempted to capture both utility and fairness, FAIR metric differs the most from individual utility metrics as well as individual fairness metrics at top-ranking positions. Also, as k increased, the reward of selecting relevant and fair items diminished due to the large position discount.