LinkedIn this week published a detailed technical post of how it rebuilt the recommendation system underpinning its Feed, deploying large language models, transformer-based architectures, and clusters of H100 GPUs to replace what had previously been a fragmented set of independent retrieval pipelines. The blog post, authored by Hristo Danchev and published on March 12, 2026, is the most granular disclosure LinkedIn has made about the mechanics of content ranking on its platform, which now reaches more than 1.3 billion professionals.

The disclosure matters for the marketing community for a specific reason: the Feed is not merely a social tool. It is the primary surface on which organic and paid content compete for attention on LinkedIn, a platform that accounted for 41% of total B2B paid media budgets in 2025, according to Dreamdata's most recent benchmarks. How the Feed ranks content - what it surfaces and what it suppresses - has direct implications for content strategy, audience targeting, and ultimately return on ad spend.

The old architecture and its limits

Until this rebuild, according to the announcement, LinkedIn's Feed retrieval relied on what the company describes as a "heterogeneous architecture." When a member opened the Feed, content arrived from multiple separate systems running in parallel: a chronological index of network activity, trending posts filtered by geography, collaborative filtering based on similar members' interests, industry-specific trending content, and several embedding-based retrieval systems. Each source maintained its own infrastructure, its own index, and its own optimization logic. The architecture was functional. It produced diverse results. But it carried, according to LinkedIn, "substantial maintenance costs" and made holistic optimization difficult because no single team could tune across all sources simultaneously.

The ranking layer compounded the problem. According to the announcement, the previous approach "treated each impression independently" - meaning that when the system evaluated whether to show a post, it made that judgment in isolation, with no reference to what the member had just read, or what trajectory their interests appeared to be following.

A unified retrieval system

The new system replaces the multi-source architecture with a single, unified retrieval pipeline built around LLM-generated embeddings. The core idea is that a sufficiently capable language model, fine-tuned on LinkedIn engagement data, can represent both posts and member profiles as vectors in a shared embedding space - and that semantic proximity in that space is a better signal of relevance than any combination of keyword matching, collaborative filtering, or geographic trending alone.

According to the post, the practical difference becomes visible in what the company calls "cold-start" scenarios. When a new member joins with only a profile headline and a job title, a traditional embedding system can identify shallow correlations - power, energy, electronics. A model trained on a large pretraining corpus understands deeper relationships: that an electrical engineer who mentions grid optimization likely has latent interest in renewable energy infrastructure and small modular reactors, even if no engagement history yet exists to confirm it.

From structured data to LLM-readable prompts

One of the more technically specific disclosures in the post concerns how LinkedIn converts structured profile and post data into text that the LLM can process. The company built what it calls a "prompt library" - a system of templates that transforms features into templated sequences. For posts, this includes format, author headline, company, industry, engagement counts, and post text. For members, it includes work history, skills, education, and - critically - a chronologically ordered sequence of posts the member has previously engaged with.

The team ran into an unexpected problem with numerical features. Raw engagement counts, passed into the prompt as plain text - "views:12345" - were treated by the model as arbitrary text tokens, with no ordinal meaning. The result was near-zero correlation (-0.004) between item popularity and the cosine similarity scores between member and item embeddings. Popularity is, in practice, a strong relevance signal. The team needed the model to understand it.

The solution was percentile bucketing. Instead of passing a raw count, the system now converts it to a percentile rank, wrapped in special tokens: <view_percentile>71</view_percentile>. This tells the model that the post sits in the 71st percentile of view counts - above average but not exceptional. Crucially, most percentile values between 1 and 100 tokenize as a single token, making the representation stable and learnable. The result was a 30x improvement in correlation between popularity features and embedding similarity, and a 15% improvement in recall@10 - meaning the top 10 retrieved posts were measurably more relevant.

Training dual encoders

The retrieval model uses a dual encoder architecture. A shared LLM processes both member prompts and item prompts, generating embeddings compared via cosine similarity. Training used InfoNCE loss with two types of negative examples: easy negatives (randomly sampled posts the member never saw) and hard negatives (posts that were shown but received no engagement). The distinction matters. Hard negatives force the model to learn nuanced distinctions between content that is nearly relevant and content that is genuinely valuable.

Adding just two hard negatives per member improved recall@10 by 3.6%, relative to a baseline using only easy negatives. One hard negative per member produced a 2.0% improvement. The marginal value of the second hard negative - an additional 1.6 percentage points - was substantial given the engineering cost involved.

A second key finding concerned the composition of the member's interaction history. Initially, the team included all impressed posts, both those the member engaged with and those they scrolled past. This hurt performance and increased compute costs. GPU compute scales quadratically with context length. By filtering to include only posts that received positive engagement, the team achieved a 37% reduction in per-sequence memory footprint, the ability to process 40% more training sequences per batch, and a 2.6x faster training iteration - all at equivalent or better model quality. The training configuration used 8 H100 GPUs.

Sequential ranking with the Generative Recommender

Retrieval determines which posts reach the ranking stage. What a member actually sees is determined by a separate model LinkedIn calls the Generative Recommender (GR). Unlike the previous approach - which scored each post independently - the GR model treats a member's Feed interaction history as a sequence. It processes more than 1,000 historical interactions to understand temporal patterns and long-term interest trajectories.

The architecture uses transformer layers with causal attention, meaning each position in the sequence can only attend to previous positions. This mirrors the actual temporal flow of how a member experiences content. A member who engaged with machine learning content on Monday, then distributed systems on Tuesday, is not displaying two random preferences. According to the post, "a sequential model understands the trajectory."

After the transformer layers, the model uses a technique called late fusion: the transformer output is concatenated with per-timestep context features - device type, member profile embeddings, aggregated affinity scores - before being passed through a Multi-gate Mixture-of-Experts (MMoE) prediction head. Passive tasks (click, skip, long-dwell) and active tasks (like, comment, share) receive specialized gating while sharing the same underlying sequential representations.

The decision to fuse count and affinity features after, rather than inside, the transformer sequence is a deliberate computational trade-off. Including them in the self-attention pathway would inflate cost quadratically - self-attention already scales quadratically with sequence length. Since these features provide independent signal strength rather than sequential interaction value, late fusion delivers their benefit without the cost penalty.

Serving at scale

The engineering challenge is not only building a capable model. It is serving that model to thousands of queries per second with sub-second latency. Traditional ranking models at LinkedIn ran on CPUs. Transformers are different. Self-attention scales quadratically with sequence length, and parameter counts in the billions require the high-bandwidth memory available only on GPUs.

LinkedIn's solution involves a set of custom infrastructure choices. A disaggregated architecture separates CPU-bound feature processing from GPU-heavy model inference. A shared context batching approach computes the history representation once, then scores all candidates in parallel using custom attention masks. A custom Flash Attention variant - called GRMIS (Generative Recommender Multi-Item Scoring) - delivers an additional 2x speedup over PyTorch's standard scaled dot-product attention implementation.

On the training side, a custom C++ data loader eliminates Python's multiprocessing overhead by fusing padding, batching, and packing at the native code level. Custom CUDA kernels for multi-label AUC computation reduce metric calculation from a significant bottleneck to negligible overhead.

The result, according to the post, is sub-50ms retrieval latency across an index containing millions of posts. The system retrieves the top nearest-neighbor candidates by running a k-nearest-neighbor search against a GPU-accelerated index of item embeddings. Embeddings are refreshed through three continuous nearline pipelines - prompt generation, embedding inference, and index updates - each optimized independently. New posts receive embeddings in near-real time. Existing posts gaining engagement are dynamically refreshed.

Why this matters for marketing professionals

The implications extend beyond organic reach. The Feed is the primary placement for LinkedIn's sponsored content formats, and the same ranking logic that determines which organic posts a member sees operates alongside paid placement decisions. LinkedIn's ad platform has grown significantly in recent years, with B2B return on ad spend reaching 121% in 2025, according to Dreamdata's March 10, 2026 report. A Feed that better models professional interest trajectories - rather than scoring posts in isolation - changes the competitive dynamics for both organic and paid content.

Marketers producing content aimed at professionals in adjacent or emerging fields may find the new system more receptive to their material, particularly if those audiences have demonstrated latent interest signals even without direct engagement history. The cold-start handling - inferring interests from profile data and world knowledge embedded in the LLM - is relevant for campaigns targeting professionals who are new to a category or role.

The question of how LinkedIn surfaces content has become more pressing as the platform's weight in B2B media plans has grown. At 41% of B2B paid media budgets in 2025, LinkedIn is the single largest line item for many advertisers - larger than any individual Google product. Understanding the mechanics of what the Feed prioritizes is not an academic exercise.

The LLM-based approach also introduces a different kind of content competition. Under keyword-based or shallow-embedding systems, a post about "data security" competes primarily with other posts about "data security." Under an LLM-based system that understands semantic relationships, the same post competes with content about regulatory compliance, cloud infrastructure, and operational risk - because the model understands those topics as interconnected for certain professional profiles. For content strategists, the breadth of effective competition expands considerably.

LinkedIn has previously disclosed its broader AI-driven content strategy, including how it restructured its own marketing operations after B2B non-brand traffic fell by up to 60% on its owned web properties. The March 12 engineering disclosure is a separate but related thread - it describes the internal recommendation infrastructure rather than the platform's external SEO posture, but both reflect the same underlying shift: the platform's content surfaces are increasingly mediated by LLM-based reasoning rather than simpler signal aggregation.

Responsible AI considerations are mentioned in the announcement. According to the post, LinkedIn "regularly and rigorously" audits models to check that posts from different creators compete on equal footing, and that the scrolling experience is consistent across audience groups. The ranking model relies on professional signals and engagement patterns, and specifically excludes demographic attributes.

Timeline

Summary

Who: LinkedIn's AI Modeling, Product Engineering, and Infrastructure teams, led by engineer Hristo Danchev.

What: A complete rebuild of LinkedIn's Feed recommendation system, replacing a multi-source retrieval architecture with a unified LLM-based dual encoder for retrieval and a transformer-based Generative Recommender (GR) model for ranking. Key technical specifics include percentile-bucketed numerical features, hard negative sampling with a 3.6% recall gain, 8 H100 GPUs for training, sub-50ms retrieval latency, and a custom Flash Attention variant delivering 2x additional speedup.

When: The engineering blog was published March 12, 2026, though the system has been rolling out as an ongoing deployment. Related infrastructure work, including SGLang-based LLM serving, was documented in a February 20, 2026 post.

Where: The changes affect the LinkedIn Feed globally, serving content to all 1.3 billion members of the platform. The infrastructure runs on GPU clusters, with nearline pipelines refreshing embeddings and indices continuously.

Why: LinkedIn's previous heterogeneous retrieval architecture created maintenance complexity and made holistic optimization difficult. The ranking model treated each impression independently, missing sequential patterns in professional content consumption. The rebuild aims to surface more relevant content - including from outside a member's immediate network - by modeling both semantic understanding and temporal interest trajectories.

Share this article
The link has been copied!