04Th7

Lessons learned about building a re-ranking model that I wish I had known sooner

*The photo created by rawpixel.com — www.freepik.com*

Searching, also known as Information Retrieval, is a common area in Machine Learning (ML), and it is primarily employed by large corporations because it requires vast amounts of data and good infrastructure to construct.Not only are data and architecture cumbersome, but there are also several challenges that data scientists must face. Unfortunately, few individuals discussed those issues; they were primarily focused on ML models. Our team at Pixta Inc. learned so hard from many issues when working on one of Japan’s most well-known image stock platforms: Pixta Stock. The story is about the practical matters that we addressed.

Bias Problems

Like many teams worldwide, we create a system to provide end users with the most relevant, fresh, and diversified results. The baseline model is basic and apparent; we exploit several features relevant to our business expectations and use them to train a tree-based model—XGBoost. Most of them are behavioral in nature.

These include clicking, buying, and the position of items, or whatever actions we can collect legally from buyers; this helped the model work exceptionally well, as mentioned in Amazon’s publication:

“In product search, hundreds of products might share very similar descriptions and seem to be equally relevant to a particular query. But some of those products are more popular than others and should be ranked higher. That’s why behavioral features drive the rankings in Amazon Search to a much larger extent than they do in Web Search. Typically, they account for most of the variance reduction in gradient-boosted trees.”

Still, these kinds of features are biased features derived from the position of the items on the search page. Empirical evidence showed that users tend to click or view the data revealed to them by a ranking system, but interestingly, the results are not always of the best relevance. People interact with them just because they are right there in front of their eyes.

You can imagine that the percentage of users’ views is more than 97% on the first page, and on a particular page, there is 90% view time in the top and bottom positions. If we develop a model including biased features, it seems to depend much on them. In other words, the importance ratio of these features is much higher than the others. This bias is much stronger if we attach behaviour to both features and assign relevance scores.

As pointed out in Position Bias Estimation for Unbiased Learning to Rank in Personal Search, a ranking model should take the position of items as an input feature to address this problem.

Having A Suitable Feature Is Essential

You may not know how many features are sufficient; some models contain 20, 50, or even 100 features. Afterwards, using feature selection techniques, the team can shortlist the proper features to train a ranking model. The first measurement is often on offline metrics. Getting an excellent signal is not equivalent to a complete set of features unless you check it with an A/B test or final metrics and have a reasonable explanation for each feature. To put it differently, the book by Christoph Molnar offers a good guideline for interpreting a machine learning model.

Metric Problems

Several metrics estimate the performance of a ranking model. Offline metrics like NDCG and MAP are based on historical data, which may differ in the future. Especially when we must wait a while before deploying the final solution. Online metrics based on users’ actions, collected after applying a ranking module in production, are necessary. However, the best option is to continuously measure the model’s performance aligned with the business goals. Please consider MLOPs, a set of practices and tools that aim to automate and streamline the lifecycle of machine learning (ML) models, from development to deployment and maintenance.

Focus On Your Target

Suppose your strategy focuses on revenue-generating and emphasises the emergence of higher-priced items. That is, a photo that receives five bids for $3 should be rated lower than one that receives only one offer but earns $30. In the past, we attempted to develop a model that intended to use the click-through rate (CTR) metric, believing that the more clicks people press on the photos, the higher the chance such photos get bought. Plus, CTR is often much higher than Conversion Rate (CV), which allows us to end the A/B test sooner. The model was successful in terms of clicks but had a 10% loss in revenue. Eye open and lesson learnt.

Negative vs. Positive Ratio

Engineers define an impression as the time users look at the search results. When generating training data, an impression without user interaction is labelled negative and vice versa. Note that the click actions ratio or buying actions ratio is minuscule. For an image stock platform, even with millions of user records per month, this rate is often 1% or less in total traffic; buying behaviour is much smaller than click behaviour. One of the valuable ideas to reduce the imbalance in training data is to downsample the number of negatives for faster training. This sub-sampling would not impact the capacity of the ML model. However, it is noteworthy to keep the number of non-actions in the validation and test datasets as high as possible; otherwise, you are giving the model an easy game and will gain an unexpectedly high metric.

Concept Drift and Data Drift

One major challenge for ranking systems is that user behaviour constantly changes. On top of that, new documents are continuously uploaded to the database. When the distribution of features in the existing data shifts, we experience data drift. Meanwhile, concept drift occurs when new trends emerge in the incoming data. Read more about these concepts here by ChipHuyen.

In one of the systems we are working on, we often handle around 30,000 new contents daily, along with millions of user actions and thousands of metadata entries. Our model quickly becomes outdated with such a large and ever-changing data stream. Based on our experiments and the rapid pace of change, we’ve found it necessary to retrain the model frequently, at least once a month. Of course, the cost of processing data and re-training the model must be considered.

Online learning is a more practical and scalable solution. Thanks to advances in MLOps, integrating online learning into our workflow has become much easier.

Balanced Criteria

That said, it’s essential to consider the key criteria a search engine should meet—fairness, freshness, diversity, and relevance. If these factors aren’t properly balanced, users might be drawn to your search engine initially, but will likely lose interest over time due to repetitive or boring results. Adding more options to filter out the desired outcome is a good rescue buoy, but the ranking model should do its best in balancing the criteria.

Exploration or exploitation. There is a trade-off between exploration and exploitation. You only show the most relevant information that gets users’ interactions (exploitation) unless you lower the position of the new information (exploration). Balancing two attributes tends to attract end-users.

Business Alignment

It’s crucial to align your ranking model’s training objectives with overall business goals. In many cases, a single purpose isn’t sufficient. You might have a primary target, such as revenue, along with secondary or guardrail targets to ensure balanced performance.

Running an A/B test on a specific user segment is often necessary to validate whether your chosen objective drives business value. I’ve written a detailed article on A/B testing—Mastering A/B Testing: Unlock Success in Data-Driven Project—for your reference. Feel free to check it out. I’ve also included additional insights on how these objectives connect to business outcomes.

Even though an A/B test is a “lazy” way to verify your system, please take your time to analyse the collected data; some hidden gems will help you reach a successful conclusion.

Reference

TAGS:

conceptdrift, datadrift, metricproblem, metrics, rankingmodel

Lessons learned about building a re-ranking model that I wish I had known sooner

Bài viết mới

By Categories

Một công cụ có thể nhận dạng được high-score tag

Chúng tôi bắt đầu thử sử dụng ACM khi đến lúc phải gia hạn SSL

Chuyển đổi công cụ front-end từ browserify sang webpack

Thực tập sinh người nhật Wataru Negoro – “Không có khoảng cách nào giữa tôi và các thành viên của Pixta Vietnam”

[P.2] Những điều `must-know` khi sử dụng kiến trúc microservice!

Học cách trở thành Frontend Developer hiện đại (2019)

Lessons learned about building a re-ranking model that I wish I had known sooner

Lessons learned about building a re-ranking model that I wish I had known sooner

Bias Problems

Having A Suitable Feature Is Essential

Metric Problems

Concept Drift and Data Drift

Balanced Criteria

Business Alignment

Reference

TAGS:

Bài viết mới

By Categories

NEWSLETTER