Lessons learned about building a re-ranking model that I wish I had known sooner

Searching, also known as Information Retrieval, is a common area in Machine Learning (ML), and it is primarily employed by large corporations because it requires vast amounts of data and good infrastructure to construct.Not only are data and architecture cumbersome, but there are also several challenges that data scientists must face. Unfortunately, few individuals discussed those issues; they were primarily focused on ML models. Our team at Pixta Inc. learned so hard from many issues when working on one of Japan’s most well-known image stock platforms: Pixta Stock. The story is about the practical matters that we addressed.
Bias Problems
Like many teams worldwide, we create a system to provide end users with the most relevant, fresh, and diversified results. The baseline model is basic and apparent; we exploit several features relevant to our business expectations and use them to train a tree-based model—XGBoost. Most of them are behavioral in nature.
These include clicking, buying, and the position of items, or whatever actions we can collect legally from buyers; this helped the model work exceptionally well, as mentioned in Amazon’s publication:
“In product search, hundreds of products might share very similar descriptions and seem to be equally relevant to a particular query. But some of those products are more popular than others and should be ranked higher. That’s why behavioral features drive the rankings in Amazon Search to a much larger extent than they do in Web Search. Typically, they account for most of the variance reduction in gradient-boosted trees.”
Still, these kinds of features are biased features derived from the position of the items on the search page. Empirical evidence showed that users tend to click or view the data revealed to them by a ranking system, but interestingly, the results are not always of the best relevance. People interact with them just because they are right there in front of their eyes.
You can imagine that the percentage of users’ views is more than 97% on the first page, and on a particular page, there is 90% view time in the top and bottom positions. If we develop a model including biased features, it seems to depend much on them. In other words, the importance ratio of these features is much higher than the others. This bias is much stronger if we attach behaviour to both features and assign relevance scores.
As pointed out in Position Bias Estimation for Unbiased Learning to Rank in Personal Search, a ranking model should take the position of items as an input feature to address this problem.

Having A Suitable Feature Is Essential
You may not know how many features are sufficient; some models contain 20, 50, or even 100 features. Afterwards, using feature selection techniques, the team can shortlist the proper features to train a ranking model. The first measurement is often on offline metrics. Getting an excellent signal is not equivalent to a complete set of features unless you check it with an A/B test or final metrics and have a reasonable explanation for each feature. To put it differently, the book by Christoph Molnar offers a good guideline for interpreting a machine learning model.
Metric Problems
Several metrics estimate the performance of a ranking model. Offline metrics like NDCG and MAP are based on historical data, which may differ in the future. Especially when we must wait a while before deploying the final solution. Online metrics based on users’ actions, collected after applying a ranking module in production, are necessary. However, the best option is to continuously measure the model’s performance aligned with the business goals. Please consider MLOPs, a set of practices and tools that aim to automate and streamline the lifecycle of machine learning (ML) models, from development to deployment and maintenance.
Focus On Your Target
Suppose your strategy focuses on revenue-generating and emphasises the emergence of higher-priced items. That is, a photo that receives five bids for $3 should be rated lower than one that receives only one offer but earns $30. In the past, we attempted to develop a model that intended to use the click-through rate (CTR) metric, believing that the more clicks people press on the photos, the higher the chance such photos get bought. Plus, CTR is often much higher than Conversion Rate (CV), which allows us to end the A/B test sooner. The model was successful in terms of clicks but had a 10% loss in revenue. Eye open and lesson learnt.
Negative vs. Positive Ratio
Engineers define an impression as the time users look at the search results. When generating training data, an impression without user interaction is labelled negative and vice versa. Note that the click actions ratio or buying actions ratio is minuscule. For an image stock platform, even with millions of user records per month, this rate is often 1% or less in total traffic; buying behaviour is much smaller than click behaviour. One of the valuable ideas to reduce the imbalance in training data is to downsample the number of negatives for faster training. This sub-sampling would not impact the capacity of the ML model. However, it is noteworthy to keep the number of non-actions in the validation and test datasets as high as possible; otherwise, you are giving the model an easy game and will gain an unexpectedly high metric.
Concept Drift and Data Drift
One major challenge for ranking systems is that user behaviour constantly changes. On top of that, new documents are continuously uploaded to the database. When the distribution of features in the existing data shifts, we experience data drift. Meanwhile, concept drift occurs when new trends emerge in the incoming data. Read more about these concepts here by ChipHuyen.
In one of the systems we are working on, we often handle around 30,000 new contents daily, along with millions of user actions and thousands of metadata entries. Our model quickly becomes outdated with such a large and ever-changing data stream. Based on our experiments and the rapid pace of change, we’ve found it necessary to retrain the model frequently, at least once a month. Of course, the cost of processing data and re-training the model must be considered.
Online learning is a more practical and scalable solution. Thanks to advances in MLOps, integrating online learning into our workflow has become much easier.
Balanced Criteria
That said, it’s essential to consider the key criteria a search engine should meet—fairness, freshness, diversity, and relevance. If these factors aren’t properly balanced, users might be drawn to your search engine initially, but will likely lose interest over time due to repetitive or boring results. Adding more options to filter out the desired outcome is a good rescue buoy, but the ranking model should do its best in balancing the criteria.
Exploration or exploitation. There is a trade-off between exploration and exploitation. You only show the most relevant information that gets users’ interactions (exploitation) unless you lower the position of the new information (exploration). Balancing two attributes tends to attract end-users.
Business Alignment
It’s crucial to align your ranking model’s training objectives with overall business goals. In many cases, a single purpose isn’t sufficient. You might have a primary target, such as revenue, along with secondary or guardrail targets to ensure balanced performance.
Running an A/B test on a specific user segment is often necessary to validate whether your chosen objective drives business value. I’ve written a detailed article on A/B testing—Mastering A/B Testing: Unlock Success in Data-Driven Project—for your reference. Feel free to check it out. I’ve also included additional insights on how these objectives connect to business outcomes.
Even though an A/B test is a “lazy” way to verify your system, please take your time to analyse the collected data; some hidden gems will help you reach a successful conclusion.
Reference
- Learning2Rank: A primer
- Practical Lessons from Predicting Clicks on Ads on Facebook
- Position Bias Estimation for Unbiased Learning to Rank in Personal Search
- Metric Learning to Rank
- Interpretable Machine Learning — A Guide for Making Black Box Models Explainable
- https://en.wikipedia.org/wiki/Feature_selection
- Machine Learning Design Interview: Machine Learning System Design Interview
- https://applyingml.com/resources/discovery-system-design/
- Learning to Rank: A Complete Guide to Ranking using Machine Learning
- https://applyingml.com/papers/#search–ranking
- Mastering A/B Testing: Unlock Success in Data-Driven Projects