Unsupervised search metrics to truly understand your users
The metrics I talked about in my previous post are rooted in information retrieval and all require human supervision. We need a labelled dataset and a human to compare the data against each other. Alternatively (or better: additionally) we can go to ours users directly and treat the behavior of users as implicit feedback to the relevance of search.
Working with implicit feedback is cheaper and time efficient
Needing to work less with annonators and omitting the collection of explicit human judgements obviously is a big time saver. Moreover using real metrics takes you out of the laboratory to real users. That's a good thing - but we should be careful not to overvalue these metrics. More on that later.
Searchers invest their time to find good results. They expect a return on their investment. We can improve their return by either reducing the effort they need to invest, or by increasing the quality of the returns. Therein lies the objective of our search: Creating a favorable return on invest (RoI) for our users.
Clicks are a good place to start - but should not be overestimated
We can start by looking at the fraction of searches that receive clicks and calculate a clickthrough-rate. But remember: clicks and all metrics you collect are only a proxy for relevance. Looking at them is a vacuum can bias your evaluation.
Clicks have a tendency to be overvalued. A standalone click doesn't tell you much about how successful the user was. A user could click on an item only to discover that it is not what he is looking for at all. And yet, the click metric would chalk this user's journey in the "success" category. It gets trickier: Maybe the search did provide utility to the user, but the user decided to pause his research because he got distracted, or decided to buy the product elsewhere.
The stronger and more business oriented metrics are conversions. For this to work the role of the search inside the customer journey needs to be understood. If sales do not occur out or are not meant to occur directly following a search, then this is a poor metric to measure effectiveness with. Even with well defined conversion metrics, they still tend to underestimate the net positive influence by the search. What constitutes a good range for any metric is completely dependent on the nature of the query and it's intent. Conversion and clicks are often, but do not have to be correlated. Users doing research and not converting, or learning something else on the page which makes them change their mind are not uncommon occurrences. Informational queries may generate many clicks, but not lead to conversions because the sought information is not meant to convert the user.
While the returns clicks give are often overvalued, conversions are often underestimated. At first glance it may seem strange to use clicks at all when you have access to conversion data. They are still very useful because conversions are usually very sparse. Furthermore the information that clicks are occurring but not leading to conversion is a informational asset in it's own right. Both should be considered - but weighed differently because clearly conversions are a much stronger signal than clicks.
Taking care not to go overboard when optimizing certain metrics
It seems a plausible strategy to optimize the search by minimizing the amount of effort the user must invest for his return. A common measure for the effort of the searcher is the length of the query. And yet, very short queries give very little information and are complex to optimize for. Blind optimization towards these sort of metrics very quickly lose sight of the utility of an optimization in contrast to the time invested in it's solution. If your user has only typed in 1 character, you shouldn't feel compelled to try and serve this query with the correct items. We must expect something of our users. Again: These are all proxies for relevance. They all contain their caveats and there is no metric which can be utilized as a single source of truth.
Looking only at quantitive data doesn't convey the full picture
We are looking at the success of search from a high level by using these metrics. They tell us what is happening, but not why. The foil against quantitative false conclusions is qualitative QA, which is the reason I argue that the human side to evaluation is not just a nice-to-have, but a must-have. This gist of annotation is the reasonable suspicion that our proxies can mislead us, and that quality raters give explicit feedback to results.
But we can also quantitively analyze the search journey more granularly. We can measure the query and retrieval segment by observing the amount of misspellings corrected, or how effective search suggestions are. We can then analyze how users interact with UI elements and investigate what their use means for the quality of results. These very in-the-trenches metrics open our eyes to pinpointing potentials in specific areas.
Measuring search success is a beast, but one which needs to be tamed. How else can you prove that what you are doing is providing value?