Metrics that matter for measuring search performance

Metrics that matter for measuring search performance
Source

Pivoting from being an SEO to being a product manager responsible for building a search engine made me really appreciate just how unassailably good Google is in its quality and sophistication. And yet, when building a search engine you will inevitably be benchmarked against Google, maybe internally but especially by users. After all: it is a technology used daily by almost all and we have all grown accustomed to excellent search.

Evaluating how good a search is an assessment of how well the search engine serves it's users and how happy users are with the results. There are many avenues for measuring search success. The key term here is the relevance of results. Relevance is a concept and not a metric. Metrics exist to measure relevance. Relevance is the encapsulation of how well a document answers the need of the searcher (the user intent).

Users are unforunately very unskilled at expressing their intent. Search is subject to an inherent fuzziness because of homonyms (words that can mean different things, e.g. ‘bat’ or ‘match’) and the circumstance that different queries evoke different user behaviors. While Google - through BERT and more recently MUM - is very effective at understanding the grander context of a query, it is a taller order for us mortal makers of search engines.

Recall and precision as the OG quality evaluation metrics in information retrieval

It is the role of the search engine to return all results that are relevant, and filter out anything that is not. The two most basic metrics for measuring this are precision and recall.

Recall describes the percentage of relevant documents inside the corpus that are returned in the results, while precision describes the percentage of relevant documents inside the returned results. Say you have 100 documents in total, and 25 of these are relevant for a query. Assuming you returned all of these 25 documents correctly, but also 5 additional unrelated documents (totaling 30). Your precision and recall would calculate as follows:

Precision: 25 retrieved relevant / 30 total retrieved = 0,8

Recall: 25 retrieved relevant / 25 total relevant = 1

A score of one is a perfect score. These metrics are inversely related to one another, something described in the precision-recall curve. Why is that? Optimizing for high recall means retrieving more documents. Aiming for high precision means being rigorous in omitting documents, therefore reducing recall.

What metrics you use is therefore philosophically important for the construction of your search engine: Do you only want to present users with documents in which you are certain of their precision (with the risk of not having the complete picture), or do you want to give a greater selection of documents through which the user must navigate with greater effort?

Precision and recall were conceptualized for small IR databases, but are inadequate for databases with millions of documents. This also applies to the F-Measure, which combines precision and recall and is also a common metric.

Something you learn very fast in SEO is how little results past the first page matter. In fact, even the bottom half of the first page is of modest importance. The inference is that looking at the entire corpus, and even anything past the first results of a set is not needed, because the user has no need for it. This is the reason why most modern metrics include a damping factor to reduce the relevance of later results - some go further and ignore anything pertaining to results past the first page. This can alleviate many pains, as there is no longer a need to rank every document/query pair in the corpus.

Precision@K: Confining measurement to an arbitrary amount of documents K

Precision@K (an arbitrarily selected range) allows you to evaluate the precision of the documents inside the determined range. Very common are the first 5 or 10 results (P@5 and P@10 respectively). This approach refines the above stated circumstance: top-ranked results matter the most.

Confining measurement to K makes it so that if there was a result on position 11 of high relevance, it may as well be on position 100 as these are not considered. And as most users never navigate past the first page, thats completely OK.

While this is a excellent metric for informational or transactional queries, it is often the case that the user is searching for one specific result (=conducting a navigational search). This mean that there is only one relevant result and all others are immaterial. This is a drawback of precision and precision@K: the ranking of the documents in the range K or not considered, only their general occurrence. If you want to measure how well ranked the best result is, mean reciprocal rank (MRR) is a better suited.

Mean Reciprocal Rank (MRR) for cases in which there is one relevant result

The MRR aims to discern how close the top the most important result is - it is an order aware metric.  Mathematically the rank of the engaged search result is put into the numerator. The denominator is 1. So if your result for a navigational search is on the first position you would receive a perfect score, i.e. 1/1 = 1. Were the result in position 4, the reciprocal rank would be 1/4 = 0,25.  The outcome is than averaged to generate the MRR.

This approach is biased towards query where there is a specific intent and the user is not looking for different options.  For users exploring options - doing an informational search - mean average precision (MAP) is a better fit.

Mean Average Precision (MAP): calculating the mean precision across a query set

Most modern search metrics have some sort of damping factor that discounts later results. MAP is one these metrics. Instead of - as the case with MRR - checking if that one relevant document is ranked highly, MAP describes if all relevant documents are ranked highly. MAP sets K to the number of the last relevant result in the result set. Accordingly, if the interacted with documents are in the top 10, the MAP will be really good. If the interacted results are results 60-70, the MAP will be poor.

Average precision checks if the ground-truth relevant items detected as a precise resulted are actually ranked higher. The mean in MAP comes from the fact that precision is averaged across a set queries.

MAP is really good for understanding what the nature of a query is (informational or navigational), because the user behavior will convey it to you (is he opening many documents, or just select ones?)

A glaring fallacy in the way we have thus far approached these metrics is the assumption that documents fall in a binary category of relevant or not relevant.

Moving away from binary judgements of relevance

Graded relevance scales make these two judgments the extremes on the ends of a scale, with a gray area inbetween. Users can be either asked directly in offline evaluation - or user signals can be used to derive the relevance of a result, e.g. through their time-on-result or interactions (subscription, download or share).

Even Google makes use of an extensive network of quality raters to assess their search, and arm their quality raters with extensive documentation on how to do this. For offline evaluation this is indispensible, because annotators are merely a proxy for the user, and not the user themselves. While you may attempt to reverse engineer a query to the origin of the intent, an annotator cannot but themself  in the shoes of the searcher entirely. But this topic is a discussion for another day.

Adding more nuance: CG, DCG and nDCG are metrics that evolved to rate both relevance and order

Cumulative Gain (CG) has a very simple approach of simply summating the relevance score of a result set. It assumes that every result in a set adds value.
Gain = the relevance score for each items
Cumulative = sum of the gains of the items up to a certain rank K

What does CG does not do, is consider the rank of results. If the first ranked result has a relevance score of 5 and the second ranked result has a score of 2, the CG (7) would not change even if the two results were swapped. This means a vastly inferior item would be ranked lower, and not be reflected in this metric. Because we want exactly this to be reflected in our metrics. Items with high relevance should be the top results because they are engaged with more.

Discounted Cumulative Gain (DCG) achieves this. It introduces a log-based penalty to reduce scores of items on lower positions, meaning a relevant result in position 7 is penalized to have a lower score than a relevant result in position 4.

This simple discounting problem may solve the immediate problem, but a glaring issue with DCG is that queries may have a different amount of results, which means that queries with more results will achieve better scores. It makes comparisons across queries difficult: You are comparing apples with oranges.

Normalized Discounted Cumulative Gain (nDCG) circumvents this by comparing the DCG to the ideal DCG (= the results if they had been ranked optimally). Therein lies the difficulty with nDCG: What is the ideal order?

If you are only looking at only the results set @10 and have only retrieved 1 relevant item out of 10 possible ones. You could have a perfect nDCG (even if you have a terrible precision!) by ranking this one result in the first position. If a second algorithm managed to retrieve all 10 relevant results, but ordered them suboptimally, that nDCG would be lower than in the first case.

The metrics that really matter rest on the search intent of your user

This post has hopefully given a good overview of the most common ways used to measure search. What your northstar metric is hinges on what kind of user behavior you are expecting, and that depends on what task the user is pursuing. But single metrics should not be viewed in a vacuum. Even a metric like precision@K has it's origins as a complimentary metric at TREC conferences. Taking in array of metrics into account makes sure you do not have any blind spots, or have a misguided view on how your users search.

More then any of these metrics which are rooted in information retrieval are the metrics you have defined that really generate business value. But that is a topic for another day.