Scaling Recommender Transformers to a Billion Parameters

October 22, 2025

18

! My identify is Kirill Khrylchenko, and I lead the RecSys R&D staff at Yandex. One in all our targets is to develop transformer applied sciences inside the context of recommender methods, an goal we’ve been pursuing for 5 years now. Not too way back, we reached a brand new milestone within the growth of advice applied sciences, which I want to share with you on this article.

The relevance of recommender methods on the earth is simple to justify: the quantity of content material is rising extremely quick, making it unimaginable to view in its entirety, and we want recommender methods to deal with the knowledge overload. Music, films, books, merchandise, movies, posts, pals, nevertheless it’s necessary to do not forget that these providers profit not solely customers but in addition content material creators who want to seek out their target market.

We’ve deployed a brand new technology of transformer recommenders in a number of providers and are actively integrating them with different providers. We’ve considerably improved the standard of the suggestions throughout the board.

When you’re an ML engineer working with suggestions, this text will give you some concepts on how one can implement an identical method on your recommender system. And if you’re a person, you will have a chance to study extra about how that very recommender system works.

How Recommender Programs Work

The advice drawback itself has a easy mathematical definition: for every person

we need to choose gadgets, objects, paperwork, or merchandise

that they’re more likely to like.

However there’s a catch:

Merchandise catalogs are huge (as much as billions of things).
There’s a important variety of customers, and their pursuits are continuously shifting.
Interactions between customers and gadgets are very sparse.
It’s unclear how one can outline precise person preferences.

To deal with the advice drawback successfully, we have to leverage non-trivial fashions that use machine studying.

Neural networks are a potent machine studying instrument, particularly when there’s a considerable amount of unstructured knowledge, comparable to textual content or photos. Whereas conventional classical machine studying includes knowledgeable area data and appreciable handbook work (characteristic engineering), neural networks can extract advanced relationships and patterns from uncooked knowledge virtually mechanically.

Within the RecSys area, we have now a considerable amount of principally unstructured knowledge (actually trillions of anonymized user-item interactions), in addition to entities which are content-based (gadgets encompass titles, descriptions, photos, movies, and audio; customers might be represented as sequences of occasions). Moreover, it’s essential that the recommender system performs effectively for brand spanking new gadgets and chilly customers, and encoding customers and gadgets via content material helps obtain this.

The time we have now to generate suggestions for the person could be very strictly restricted. Each millisecond counts! Moreover, we don’t have infinite sources (by way of {hardware}), and the catalogs we want suggestions from are fairly giant. For this reason suggestions are often shaped in a number of levels:

First, we choose a comparatively small set of candidates from the complete catalog utilizing light-weight fashions (retrieval stage).
Then, we run these candidates via extra advanced fashions that make the most of further info and extra intensive computations for every candidate (rating stage).

Architecturally, fashions fluctuate considerably between levels, making it difficult to debate any facet with out referring to particular levels of the recommender system.

Multi-stage recommender methods, Picture by Writer

The 2-tower neural community structure could be very common for the retrieval stage. Customers and gadgets (for info retrieval, this is able to be queries and paperwork, independently encoded into vector representations) are used, and the dot product is employed to calculate the similarity between them.

You would additionally say that such fashions “embed” customers and gadgets right into a shared “semantic area”, the place the semantic facet represents the truth that the nearer the user-item pair is by way of vector area, the extra comparable they’re.

Two-tower fashions are high-speed. Let’s assume the person requests suggestions. The 2-tower mannequin then must calculate:

The “person tower” as soon as per request.
Vectors of all candidate gadgets for which you need to calculate user-item affinity.
Dot merchandise.

You don’t even have to recalculate the vectors of candidate gadgets for every person question, as a result of they’re the identical for all customers and barely change; as an illustration, we don’t assume {that a} film or a music observe typically modifications its title. In observe, we frequently recalculate merchandise vectors for the complete catalog offline (for instance, day by day) and add them to both the service the place we have to calculate the dot product or to a different service that we entry on-line to retrieve the mandatory merchandise vectors.

However that’s me describing a use case the place you will have some cheap, small variety of candidates you need to calculate user-item affinities for. That is true for the rating stage. Nevertheless, on the candidate technology stage, the issue turns into extra sophisticated: we have to calculate proximities for all gadgets within the catalog, choose the top-N (the place N is often expressed in a whole lot to 1000’s) with the very best affinity values, after which ahead them to the next levels.

That is the place two-tower fashions are invaluable: we will shortly generate an approximate top-N by scalar product, even for enormous catalogs, utilizing approximate search strategies. We construct a selected “index” (usually a graph construction, comparable to within the HNSW methodology) for the set of already calculated merchandise vectors that we will retailer within the service and use to feed person vectors, extracting an approximate prime for these vectors.

Constructing this index is troublesome and time-consuming (with a separate problem of shortly updating and rebuilding an index). With that being mentioned, it may nonetheless be completed offline, after which the binary and the index might be uploaded to the service, the place we’ll seek for candidates within the runtime setting.

Two-tower neural community, Picture by Writer

How Do We Encode a Person Right into a Vector?

Classical algorithms solved this drawback fairly simply: in matrix factorization strategies (like ALS), the person vector was “trainable”, represented by the mannequin parameters, and decided inside the optimization process. In user-item collaborative filtering strategies, a person was assigned a vector of catalog dimensionality by which the i-th coordinate corresponded to a selected merchandise and represented how typically the person interacted with that merchandise (e.g., how regularly they purchased it or how they rated it).

The trendy method could be to encode customers with transformers, suggesting {that a} person might be encoded right into a vector utilizing transformers. We take the person’s anonymized historical past—that’s, a sequence of occasions—and encode these occasions into vectors, then make the most of a transformer. In essentially the most fundamental case, occasions are represented by purchases or likes; nonetheless, in different instances, it could possibly be the complete historical past of interactions inside an organization’s ecosystem.

Initially, when transformers had been first utilized in suggestions, researchers drew analogies from similarities with NLP: a person is sort of a sentence, and the phrases in it signify purchases, likes, and different interactions.

Two-tower neural community design with a transformer Picture by Writer

One other sort of neural community recommender mannequin is fashions with early fusion. These fashions don’t separate person and merchandise info into two towers however fairly course of all info collectively. That’s, we fuse all details about the person, the merchandise, and their interplay at an early stage. In distinction, two-tower fashions are mentioned to characteristic late fusion via the scalar product. Early-fusion fashions are extra expressive than two-tower fashions. They will seize extra advanced alerts and study extra non-trivial dependencies.

Nevertheless, it’s troublesome to use them exterior the rating stage due to their computational burden and the necessity to recalculate the complete mannequin for every person question and every candidate. Not like two-tower fashions, they don’t assist the factorization of computations.

We make the most of varied structure sorts, together with two-tower fashions with transformers and fashions with early fusion. We use two-tower architectures extra actually because they’re extremely environment friendly, appropriate for all levels concurrently, and nonetheless yield good high quality features with significantly fewer sources.

We used to coach two-tower fashions in two levels:

Pre-training with contrastive studying. We practice the mannequin to align customers with their optimistic user-item interactions utilizing contrastive studying,
Job-specific fine-tuning. As with NLP, fine-tuning is a task-specific method. If the mannequin shall be used for rating, we practice it to precisely rank the suggestions proven to the person. We confirmed two gadgets—the person appreciated one, disliked the opposite—we need to rank gadgets in the identical order. With retrieval, the duty resembles pre-training however employs further methods that improve the candidates’ recall.

Within the subsequent part, we’ll discover how this course of has modified with our newer fashions.

Scaling Recommender Programs

Is there a restrict to the dimensions of recommender fashions, after which we now not see size-related enhancements within the high quality of suggestions?

For a very long time, our recommender fashions (and never simply ours, however fashions throughout business and academia) had been very small, which instructed that the reply to this query was “sure”.

Nevertheless, in deep studying, there’s the scaling speculation, which states that as fashions turn out to be bigger and the quantity of knowledge will increase, the mannequin high quality ought to enhance considerably.

A lot of the progress in deep studying over the previous decade might be attributed to this speculation. Even the earliest successes in deep studying had been based mostly on scaling, with the emergence of an in depth dataset for picture classification, ImageNet, and the nice efficiency of neural networks (AlexNet) on that dataset.

The scaling speculation is much more evident in language fashions and pure language processing (NLP): you may predict the dependence of high quality enchancment on the quantity of computations and categorical the corresponding scaling legal guidelines.

Dashboard parameter overview. Picture by Writer

What do I imply once I say recommender fashions might be made larger?

There are as many as 4 completely different axes to scale.

Embeddings. We’ve a wide range of details about customers and gadgets, so we have now entry to a variety of options, and a big portion of those options are categorical. An instance of a categorical characteristic is Merchandise ID, artist ID, style, or language.

Categorical options have a really excessive cardinality (variety of distinctive values)—reaching billions—so in the event you make giant trainable embeddings (vector representations) for them, you get large embedding matrices.

That mentioned, embeddings are the bottleneck between the enter knowledge and the mannequin, so you want to make them giant for good high quality. For instance, Meta* has embedding matrices with dimensions starting from 675 billion to 13 trillion parameters, whereas Google reported at the very least 1 billion parameters in YoutubeDNN again in 2016. Even Pinterest, which had lengthy promoted inductive graph embeddings from PinSage [1, 2], has lately began utilizing giant embedding matrices.

Context size. For many years, recommender system engineers have been busy producing options. In trendy rating methods, the variety of options can attain a whole lot and even 1000’s, and Yandex additionally provides such providers.

One other instance of “context” in a mannequin is the person’s historical past in a transformer. Right here, the dimensions of the context is decided by the size of the historical past. In each business and academia, the quantity tends to be very small, with just a few hundred occasions at finest.

Coaching dataset measurement. I already talked about that we have now a variety of knowledge. Recommender methods produce a whole lot of datasets which are comparable in measurement to the GPT-3 coaching dataset.

The business has a number of use instances of huge datasets with billions of coaching examples on show: 2 billion, 2.1 billion, 3 billion, 60 billion, 100 billion, 146 billion, 500 billion.

Encoder measurement. The usual for early-fusion fashions shall be in hundreds of thousands or tens of hundreds of thousands of parameters. In response to the Google papers, “simplified” variations of their Vast&Deep fashions had 1 to 68 million parameters for the experiments [1, 2]. And if we use a two-layer DCN-v2 (a well-liked neural community layer for early-fusion fashions) over a thousand steady options, we’ll get not more than 10 million parameters.

Two-tower fashions most frequently use tiny transformers to encode the person: for instance, two transformer blocks with hidden layer dimensionality not exceeding a few hundred. This configuration could have at most just a few million parameters.

And whereas the sizes of the embedding matrices and coaching datasets are already fairly giant, scaling the size of person historical past and the capability of the encoder a part of the mannequin stays an open query. Is there any important scaling by these parameters or not?

This was the query on our minds in February, 2024. Then an article from researchers at Meta, titled Actions Communicate Louder than Phrases, cheered us all up a bit.

The аuthors offered a brand new encoder structure known as HSTU and formulated each the rating drawback and the candidate technology drawback as a generative mannequin. The mannequin had a really lengthy historical past size (8000 occasions!) together with an in depth coaching dataset (100 billion examples), and the person historical past encoder was a lot bigger than the previous couple of million parameters. Nevertheless, even right here, the most important encoder configuration talked about, has solely 176 million parameters, and it’s unclear whether or not they applied it (judging by the next articles, they didn’t).

Are 176 million parameters in an encoder loads or just a little? If we take a look at language fashions, the reply is evident: an LLM with 176 million parameters within the encoder shall be extremely inferior in functionality and problem-solving high quality to trendy SOTA fashions with billions and even trillions of parameters.

Why, then, do we have now such small fashions in recommender methods?

Why can’t we obtain an identical leap in high quality if we substitute pure language texts with anonymized person histories by which actions act as phrases? Have recommender fashions already reached the ceiling of their baseline high quality, and all we have now left is to make small incremental enhancements, tweaking options and goal values.

These had been the existential questions we requested ourselves when designing our personal new ARGUS method.

RecSys × LLM × RL

After plowing via the in depth literature on scaling, we discovered that three fundamental situations decide the success of neural community scaling:

Plenty of knowledge.
Fairly expressive structure with a big mannequin capability.
Probably the most basic, basic studying activity potential.

For instance, LLMs are very expressive and highly effective transformers that study from actually all the info on the web. Moreover, the duty of predicting the subsequent phrase is a basic activity that, in actuality, decomposes into varied duties associated to completely different fields, together with grammar, erudition, arithmetic, physics, and programming. All three situations are met!

If we take a look at recommender methods:

We even have a variety of knowledge: trillions of interactions between customers and gadgets.
We are able to simply as simply use transformers.
We simply want to seek out the correct studying activity to scale the recommender mannequin.

That’s what we did.

LLM course of circulate. Picture by Writer

There’s an fascinating facet of pre-training giant language fashions. When you simply ask a pre-trained LLM about one thing, it is going to give a median reply. The probably reply it has encountered within the coaching knowledge. That reply received’t essentially be good or proper.

However in the event you add a immediate earlier than the query, like “Think about you might be an knowledgeable in X”, it is going to begin offering far more related and proper solutions.

That’s as a result of LLMs don’t simply study to mimic solutions from the web; in addition they purchase a extra basic understanding of the world in an try and condense all the knowledge from the coaching set. It learns patterns and abstractions. And it’s exactly as a result of the LLM is aware of a variety of solutions and but possesses a basic understanding of the world that we will acquire good solutions from it.

Venn Diagram : What Makes for a Good Reply? Picture by Writer

We tried to use this logic to recommender methods. First, you want to categorical the suggestions as a reinforcement studying activity:

A recommender system is an agent.
Actions are suggestions. In essentially the most fundamental case, the recommender system recommends one merchandise at a time (for instance, recommends one music observe within the music streaming app every time).
The setting means customers, their behaviors, patterns, preferences, and pursuits.
The coverage is a likelihood distribution over gadgets.
The reward is a person’s optimistic suggestions in response to a advice.

Suggestions as a Reinforcement Studying Job, Picture by Writer

There’s a direct analogy to the LLM instance. “Solutions from the web” are the actions of previous recommender methods (logging insurance policies), and basic data in regards to the world is knowing customers, their patterns, and preferences. We would like our new mannequin to have the ability to:

Imitate the actions of previous recommender methods.
Have understanding of the customers.
Modify their actions to attain a greater consequence.

Earlier than we transfer on to our new method, let’s look at the preferred setup for coaching advice transformers: subsequent—merchandise prediction. The SASRec mannequin could be very consultant right here. The system accumulates a person’s historical past of optimistic interactions with the service (for instance, purchases), and the mannequin learns to foretell which buy is more likely to come subsequent within the sequence. That’s, as an alternative of next-token prediction, as in NLP, we go for next-item prediction.

Self-Attentive Sequential Suggestion. Supply

This method (SASRec and customary subsequent merchandise prediction) is just not in line with the philosophy I described earlier, which targeted on adjusting the logging coverage based mostly on basic data of the world. It will appear that to foretell what the person will purchase subsequent, the mannequin ought to function below this philosophy:

It ought to perceive what could possibly be proven to the person by the recommender system that was in manufacturing on the time for which the prediction must be made. That’s, it ought to have mannequin of logging coverage habits (i.e., a mannequin that can be utilized to mimic).
It wants to know what the person may need appreciated from the issues proven by the previous recommender system, which means that it wants to know their preferences, that are the very basic beliefs in regards to the world.

However fashions like SASRec don’t explicitly mannequin any of these items. They lack full details about previous logging insurance policies (we solely see suggestions with optimistic outcomes), and we additionally don’t discover ways to replicate these logging insurance policies. There’s no approach to know what the previous recommender system may have provided. On the similar time, we don’t absolutely perceive the mannequin of the world or the person: we ignore all adverse suggestions and solely think about optimistic suggestions.

ARGUS: AutoRegressive Generative Person Sequential Modeling

AutoRegressive Generative Person Sequential modeling (ARGUS) is our new method to coaching advice transformers.

First, we look at the complete anonymized person historical past, together with optimistic interactions but in addition all different interactions. We seize the essence of the interplay context, the time it occurred, the system used, the product web page the person was on, their My Vibe personalization settings, and different related particulars.

ARGUS: AutoRegressive Generative Person Sequential Modeling

Person historical past is a selected sequence of triples (context, merchandise, suggestions), the place context refers back to the interplay context, merchandise represents the article the person interacts with, and suggestions denotes the person’s response to the interplay (comparable to whether or not the person appreciated the merchandise, purchased it, and many others.).

Subsequent, we determine two new studying duties, each of which lengthen past the traditional next-item prediction extensively utilized in business and academia.

Subsequent merchandise prediction

Our first activity can also be known as subsequent merchandise prediction. Wanting on the historical past and the present interplay context, we predict which merchandise shall be interacted with: P(merchandise | historical past, context).

If the historical past accommodates solely advice site visitors (occasions generated immediately by the recommender system), then the mannequin learns to mimic the logging coverage (suggestions from the previous recommender system).

If there’s additionally natural site visitors (any site visitors aside from referral site visitors, comparable to site visitors from engines like google, or if the person visits the library and listens to their favourite observe), we additionally achieve extra basic data in regards to the person, unrelated to the logging coverage.

Necessary: although this activity has the identical identify as in SASRec (subsequent merchandise prediction), it’s not the identical activity in any respect. We predict not solely optimistic but in addition adverse interactions, and likewise take into consideration the present context. The context helps us perceive whether or not the motion is natural or not, and if it’s a advice, what floor it’s on (place, web page, or carousel). Additionally, it usually reduces the noise degree throughout mannequin coaching.

Context is crucial for music suggestions: the person’s temper and their present scenario have a big impression on the kind of music they need to hearken to.

The duty of predicting a component from a set is often expressed as a classification drawback, the place the weather of the unique set function courses. Then, we have to use a cross-entropy loss perform for coaching, the place the softmax perform is utilized to the logits (unnormalized outputs of the neural community). Softmax calculation requires computing the sum of exponents from logits throughout all courses.

Whereas the dimensions of dictionaries in LLMs can attain a whole lot of 1000’s of things within the worst case, and softmax calculation is just not a big drawback, it turns into a priority in recommender methods. Right here, catalogs encompass hundreds of thousands and even billions of things, and calculating the complete softmax is an unimaginable activity. This can be a subject for a separate huge article, however finally, we have now to make use of a difficult loss perform known as “sampled softmax” with a logQ correction:

N means a mixture of in-batch and uniform negatives
logQ(n)means logQ correction
Temperature Tmeans a educated parameter Eᵀclipped to [0.01, 100].

Suggestions prediction

Suggestions prediction is the second studying activity. Contemplating historical past, the present context, and the merchandise, we predict person suggestions: P(suggestions | historical past, context, merchandise).

The primary activity, subsequent merchandise prediction, teaches us how one can imitate logging insurance policies (and understanding customers if there’s natural site visitors). The suggestions prediction activity, alternatively, is concentrated completely, on getting basic data about customers, their preferences, and pursuits.

It is extremely just like how the rating variant of the mannequin from “Actions Communicate Louder than Phrases” learns on a sequence of pairs (merchandise, motion). Nonetheless, right here the context token is handled individually, and there are extra than simply recommender contexts current.

Suggestions can have a number of parts: whether or not a observe was appreciated, disliked, added to a playlist, and what portion of the observe was listened to. We predict all sorts of suggestions by decomposing them into particular person loss capabilities. You need to use any loss perform as a selected loss perform, together with cross-entropy or regression. For instance, binary cross-entropy is enough to foretell whether or not a like was current or not.

Though some suggestions is extra frequent (there are often far fewer likes than lengthy listens), the mannequin does job of studying to foretell all alerts. The bigger the mannequin, the better it’s to study all duties directly, with out conflicts. Furthermore, frequent suggestions (listens), quite the opposite, helps the mannequin discover ways to simulate uncommon, sparse suggestions (likes).

Diagram illustrating how the transformer mannequin performs next-item and suggestions prediction. Picture by Writer

If we mix all this right into a single studying activity, we get the next:

Create histories for the person from triples (context, merchandise, suggestions).
Use the transformer.
Predict the subsequent merchandise based mostly on the hidden state of the context.
Predict the person’s suggestions after interacting with the merchandise based mostly on the merchandise’s hidden state.

The picture illustrates the distinction between the ARGUS and SASRec approaches: with ARGUS, we practice the mannequin to mimic the habits of previous recommender methods and predict the person’s response; in distinction, with SASRec, we practice the mannequin to foretell the subsequent optimistic interplay.

Let me additionally touch upon how this differs from HSTU. In Actions Communicate Louder than Phrases, the authors practice two separate fashions for candidate technology and rating. The candidate technology mannequin accommodates the complete historical past, however, like SASRec, it fashions solely optimistic interactions and doesn’t think about the loss perform in instances the place there’s a adverse interplay. The rating mannequin, as talked about earlier, learns for a activity just like our suggestions prediction.

Our answer provides a extra complete subsequent merchandise prediction activity and a extra complete suggestions prediction activity, and the mannequin learns in each capabilities concurrently.

Simplified ARGUS

Our method has one huge drawback—we’re inflating the person’s historical past. As a result of every interplay with an merchandise is represented by three tokens directly (context, merchandise, suggestions), we must feed virtually 25,000 tokens into the transformer to research 8192 latest person listens.

One may argue that that is nonetheless not important and that the context size is for much longer in LLMs; nonetheless, this isn’t totally correct. LLMs, on common, have a lot smaller numbers, usually a whole lot of tokens, particularly throughout pre-training.

In distinction, in our music streaming platform, for instance, customers typically have 1000’s and even tens of 1000’s of occasions. We have already got for much longer context lengths, and inflating these lengths by an element of three has a fair better impression on studying pace. To deal with this, we created a simplified model of the mannequin, by which every triple (context, merchandise, suggestions) is condensed right into a single vector. By way of enter format, it resembles our earlier generations of transformer fashions; nonetheless, we preserve the identical two studying duties—subsequent merchandise prediction and suggestions prediction.

To foretell the subsequent merchandise, we take the hidden state from the transformer akin to the triple (c, i, f) at a previous time limit, concatenate the present context vector to it, compress it to a decrease dimension utilizing an MLP, after which use the sampled softmax to study to foretell the subsequent merchandise.

To foretell the suggestions, we concatenate the vector of the present merchandise after which use an MLP to foretell all of the required goal variables. By way of recommender transformer architectures, our mannequin turns into much less target-aware and fewer context-aware; nonetheless, it nonetheless performs effectively, enabling a three-fold acceleration.

ARGUS Implementation

A mannequin educated on this two-headed mode for each duties concurrently (subsequent merchandise prediction and suggestions prediction) might be applied as is. The NIP head is accountable for candidate choice, and the FP head for last rating.
However we didn’t need to do this, at the very least not for our first implementation:

Our objective was to implement a really giant mannequin, so we initially targeted on offline deployment. With offline deployment, person and merchandise vectors are recalculated day by day inside a separate common course of, and also you solely have to calculate the dot product within the runtime setting.

The pre-trained model of ARGUS implies entry to the person’s historical past with none delay: we see all occasions of their historical past as much as the present time limit when the prediction is made. That’s, it must be utilized at runtime.
The NIP head predicts all person interactions, and the mannequin is often educated to foretell solely future optimistic interactions to generate candidates. However predicting optimistic interactions is a heuristic, a surrogate studying activity. It would even be higher to make use of a head that predicts all interactions as a result of it learns to be in line with the rating. If an merchandise has been really helpful, it means the rating appreciated it. However on this scenario, we weren’t able to experiment with that and as an alternative wished to comply with the well-trodden path.
The FP head learns for pointwise losses: whether or not a observe shall be appreciated or not, what portion of the observe shall be heard, and so forth. However we nonetheless typically practice fashions for pairwise rating: we study to rank gadgets that had been really helpful “subsequent to one another” and acquired completely different suggestions. Some argue that pointwise losses are enough for coaching rating fashions, however on this case, we don’t substitute the complete rating stack. As an alternative, we goal so as to add a brand new, highly effective, neural-network-based characteristic to the ultimate rating mannequin. If the ultimate rating mannequin is educated for a selected activity (comparable to pairwise rating), then the neural community that generates the characteristic is most effectively educated for that activity; in any other case, the ultimate mannequin will rely much less on our characteristic. Accordingly, we’d prefer to pre-train ARGUS for a similar activity as the unique rating mannequin, permitting us to put it to use in rating.

There are different deployment use instances past the traditional candidate technology and rating levels, and we’re actively researching these as effectively. Nevertheless, for our first deployment, we went with an offline two-tower rating:

We determined to fine-tune ARGUS in order that it could possibly be used as an offline two-tower mannequin. We use it to recalculate person and merchandise vectors day by day, whereas person preferences are decided via the dot product of the person with the gadgets.

We pre-trained ARGUS for a pairwise rating activity just like the one on which the ultimate rating mannequin is educated. Which means we have now in some way chosen pairs of tracks that the person heard and rated in a different way by way of optimistic suggestions, and we need to discover ways to rank them appropriately.

We construct these fashions very often: they’re simple to coach and implement by way of sources and growth prices. Nevertheless, our earlier fashions had been considerably smaller and realized in a different way. Not with the ARGUS process, however first with the same old contrastive studying between customers and positives, after which fine-tuned for the duty.

Our earlier contrastive pre-training process implied compiling a number of coaching examples for a person: if the person had n purchases, then there could be n samples within the dataset. That mentioned, we didn’t use autoregressive studying. That’s, we ran the transformer n instances throughout coaching. This method enabled us to be very versatile in creating pairs (person, merchandise) for coaching, use any historical past format, encode context along with the person, and account for lags. When predicting likes, we will use a one-day lag within the person’s historical past. Nevertheless, issues had been operating fairly slowly.

ARGUS pre-training employs autoregressive studying, the place we study from all occasions within the person’s exercise concurrently in a single transformer run. This can be a highly effective acceleration that allowed us to coach a lot bigger fashions utilizing the identical sources.

Throughout fine-tuning, we additionally ran the transformer many instances for a single person. It’s known as impression-level studying that Meta used to have earlier than HSTU. If a person is proven an merchandise at a selected second, we generate a pattern of the shape (person, merchandise). The dataset can comprise a lot of such impressions for a single person, and we’ll rerun the transformer for every one in every of them. For pairwise rating, we thought-about triples of the shape (person, item1, item2). Those we used earlier than.

Inspecting the acceleration in the course of the pre-training stage, we determined to make use of an identical method with fine-tuning. We develop a fine-tuning process for the two-tower mannequin to show it rating, the place the transformer solely must be run as soon as.

Diagram of how transformers use historic impressions and person states to type predictions. Picture by Writer

Let’s say we have now the person’s complete historical past for a 12 months, and all of the suggestions proven to the person inside the similar interval. By implementing a transformer with a causal masks over the complete historical past, we get vector representations of the person for all of the moments in that 12 months directly, and so we will:

Individually calculate the vectors of the proven gadgets.
Evaluate the timestamps and map advice impressions to person vectors akin to the required lag in person historical past supply.
Calculate all of the required scalar merchandise and all phrases of the loss perform.

And all of this directly for the complete 12 months—in a single transformer run.

Beforehand, we might rerun the transformer for every pair of impressions; now, we course of all of the impressions directly in a single run. This can be a huge acceleration: by an element of tens, a whole lot, and even 1000’s. To make use of a two-tower mannequin like this, we will merely use the vector illustration of the person on the final second in time (akin to the final occasion within the historical past) as the present vector illustration. For the gadgets, we will use the encoder that was used throughout coaching for the impressions. In coaching, we simulate a one-day person historical past lag after which run the mannequin as an offline mannequin, recalculating person vectors day by day.

Once I say that we course of the person’s complete 12 months of historical past in a single transformer run, I’m being considerably deceptive. In actuality, we have now a sure restrict on the utmost historical past size that we implement, and a person in a dataset can have a number of samples or chunks. For pre-training, these chunks don’t overlap.

Nevertheless, throughout fine-tuning, there are limits not solely on the utmost historical past size but in addition on its minimal size, in addition to on the utmost variety of advice impressions in a single coaching instance used to coach the mannequin for rating.

Outcomes

We selected our music streaming as the primary service to experiment with. Suggestions are essential right here, and the service has a lot of lively customers. We’ve constructed an enormous coaching dataset with over 300 billion listens from hundreds of thousands of customers. That is tens and even a whole lot of instances bigger than the coaching datasets we’d used earlier than.

What’s a triple (context, merchandise, suggestions) in a music streaming service?

Context: whether or not the present interplay is a advice or natural. If it’s a advice—what floor it’s on, and if it’s My Vibe—what the settings are.
Merchandise: a music observe. A very powerful characteristic for merchandise encoding is the merchandise ID. We use unified embeddings to encode options with excessive cardinality. On this case, we take three 512K hashes per merchandise. We use a set unified embedding matrix with 130 million parameters in our experiments.
Person suggestions: whether or not a observe was appreciated, and what portion of the observe was heard.

For offline high quality evaluation, we use knowledge from the week following the coaching interval via the worldwide temporal cut up.

To evaluate the standard of the pre-trained mannequin, we look at the loss perform values within the pre-training duties: subsequent merchandise prediction and suggestions prediction. That’s, we measure how effectively the mannequin realized to unravel the duties we created for it. The smaller the worth, the higher.

Necessary: We think about the person’s historical past over a protracted interval, however the loss perform is just calculated for occasions that happen inside the take a look at interval.

Throughout fine-tuning, we study to appropriately rank merchandise pairs based mostly on person suggestions, making PairAccuracy— a metric that measures the share of pairs appropriately ordered by the mannequin —an appropriate offline metric for us. In observe, we reweigh pairs barely extra based mostly on suggestions: for instance, pairs by which the individual appreciated and skipped a observe have a better weight than these by which the individual listened to and skipped a observe.

Our deployment situation includes including a robust new characteristic to the ultimate ranker. For that reason, we measure the relative improve in PairAccuracy for the ultimate ranker with the brand new characteristic added, in comparison with the ultimate ranker with out it. The ultimate ranker in our music streaming platform is gradient boosting.

A/B Check Outcomes and Measurements

Our preliminary objective was to scale advice transformers. To check the scaling, we chosen 4 different-sized transformer configurations, starting from 3.2 million to 1.007 billion parameters.

HSTU Efficiency take a look at. Picture by Writer

We additionally determined to check the efficiency of the HSTU structure. In “Actions Communicate Louder than Phrases“, the authors proposed a brand new encoder structure, which is sort of completely different from the transformer structure. Based mostly on the authors’ experiments, this structure outperforms transformers in advice duties.

Efficiency take a look at dashboard. Picture by Writer

There’s scaling! Every new leap in structure measurement ends in a high quality achieve, each in pre-training and fine-tuning.

HSTU proved to be no higher than transformers. We used the most important configuration talked about by the authors in “Actions Communicate Louder than Phrases.” It has one and a half instances extra parameters than our medium transformer, whereas having roughly the identical high quality.

Graph describing the connection between mannequin measurement, entropy prediction, and rating uplift. Picture by Writer.

Let’s visualize the metrics from the desk as a graph. In that case, we will observe the scaling regulation for our 4 factors: the dependence of high quality on the logarithm of the variety of parameters seems linear.

We carried out a small ablation examine to seek out out whether or not we may simplify our mannequin or take away any elements from the coaching.

Outcomes with pre-training vs with out, Picture by Writer

When you take away pre-training, the mannequin’s high quality drops.

Wonderful-tuning and pairwise accuracy outcomes, Picture by Writer

When you cut back the period of fine-tuning, the drop turns into much more pronounced.

Noticeable scaling in historical past size, Picture by Writer

At the start of this text, I discussed that the authors of “Actions Communicate Louder than Phrases” educated a mannequin with a historical past size of 8,000 gadgets. We determined to provide it a strive: it seems that dealing with such a deep person’s musical historical past ends in a noticeable enchancment in suggestions. Beforehand, our fashions utilized a most of 1,500–2,000 occasions. This was the primary time we had the chance to cross this threshold.

Implementation Outcomes

We’ve been working to develop transformers for music suggestions for about three years now and we’ve come a good distance. Right here’s the whole lot we have now realized and the way we have now progressed growing transformer-based fashions for music suggestions over this time.

Our first three transformers had been all offline. Person and merchandise vectors had been recalculated day by day. Then, person vectors had been loaded right into a key-value retailer, and merchandise vectors had been saved within the service’s RAM, whereas solely the dot product was calculated at runtime. We utilized a few of these fashions not just for rating, but in addition for candidate technology (we’re accustomed to constructing multi-head fashions that carry out each duties). In instances like this, the HNSW index, from which candidates might be retrieved, additionally resides within the service’s RAM.
The primary mannequin solely had a sign about likes, the second mannequin had a sign about listens (together with skips), and within the third mannequin, we mixed each sign sorts (specific and implicit).
The v4 model of the mannequin is an adaptation of v3, which is applied in runtime with a slight lag in person historical past, its encoder is 6x smaller than that of the v3 mannequin.
The brand new ARGUS mannequin has eight instances the person historical past size and ten instances the encoder measurement. It additionally makes use of a brand new studying course of I described earlier.

Implementation model dashboard, Picture by Writer

TLT is the overall listening time. The “like” chance represents the probabilities of a person liking a advice when it’s proven to them. Every implementation resulted in a metrics increase for our user-tailored suggestions. And the primary ARGUS gave about the identical improve in metrics as all of the earlier implementations mixed!

ARGUS Check Outcomes Dashboard, Picture by Writer

My Vibe additionally has a particular setting, which we use a separate rating stack for: Unfamiliar. We had a separate ARGUS implementation for this setting, reaching a 12% improve in whole listening time and a ten% development in chance. The Unfamiliar setting is utilized by people who find themselves fascinated about discovering new suggestions. The truth that we skilled a big improve on this class confirms that ARGUS is simpler at dealing with non-trivial eventualities.

We applied ARGUS in music eventualities on sensible gadgets and efficiently elevated the overall time customers spend with an lively speaker by 0.75%. Right here, the ultimate ranker is just not a gradient boosting mannequin, however a full-scale rating neural community. Due to this, we had been capable of not solely feed a single scalar characteristic from ARGUS but in addition go full person and merchandise vectors as enter to the ultimate ranker. In comparison with a single scalar characteristic, this elevated the standard achieve by one other one and a half to 2 instances.

ARGUS has already been applied not solely as a rating characteristic, but in addition to generate candidates. The staff has tailored the offline ARGUS right into a runtime model. These implementations yielded important features in key metrics. Neural networks are the way forward for recommender methods however there’s nonetheless a protracted journey forward.

Thanks for studying.

Scaling Recommender Transformers to a Billion Parameters

How Recommender Programs Work

How Do We Encode a Person Right into a Vector?

Scaling Recommender Programs

RecSys × LLM × RL

ARGUS: AutoRegressive Generative Person Sequential Modeling

Subsequent merchandise prediction

Suggestions prediction

Simplified ARGUS

ARGUS Implementation

Outcomes

A/B Check Outcomes and Measurements

Implementation Outcomes

Related Articles

3D Printing Information Briefs, November 29, 2025: Submarine Industrial Base, Operating Shoe, & Extra – 3DPrint.com

Intro to Nest.js: Server-side JavaScript growth on Node

What mother and father ought to know to guard their youngsters from doxxing

LEAVE A REPLY Cancel reply

Latest Articles

3D Printing Information Briefs, November 29, 2025: Submarine Industrial Base, Operating Shoe, & Extra – 3DPrint.com

Intro to Nest.js: Server-side JavaScript growth on Node

What mother and father ought to know to guard their youngsters from doxxing

5 Excel AI Classes I Discovered the Exhausting Method

North Korean Hackers Deploy 197 npm Packages to Unfold Up to date OtterCookie Malware

About US