How LLMs Work: Reinforcement Studying, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

February 27, 2025

61

Welcome to half 2 of my LLM deep dive. When you’ve not learn Half 1, I extremely encourage you to test it out first.

Beforehand, we lined the primary two main levels of coaching an LLM:

Pre-training — Studying from large datasets to type a base mannequin.
Supervised fine-tuning (SFT) — Refining the mannequin with curated examples to make it helpful.

Now, we’re diving into the following main stage: Reinforcement Studying (RL). Whereas pre-training and SFT are well-established, RL remains to be evolving however has turn out to be a essential a part of the coaching pipeline.

I’ve taken reference from Andrej Karpathy’s broadly fashionable 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the concept.

Let’s go 🚀

What’s the aim of reinforcement studying (RL)?

People and LLMs course of data otherwise. What’s intuitive for us — like fundamental arithmetic — is probably not for an LLM, which solely sees textual content as sequences of tokens. Conversely, an LLM can generate expert-level responses on complicated matters just because it has seen sufficient examples throughout coaching.

This distinction in cognition makes it difficult for human annotators to offer the “good” set of labels that constantly information an LLM towards the suitable reply.

RL bridges this hole by permitting the mannequin to study from its personal expertise.

As a substitute of relying solely on express labels, the mannequin explores completely different token sequences and receives suggestions — reward alerts — on which outputs are most helpful. Over time, it learns to align higher with human intent.

Instinct behind RL

LLMs are stochastic — that means their responses aren’t mounted. Even with the identical immediate, the output varies as a result of it’s sampled from a chance distribution.

We are able to harness this randomness by producing 1000’s and even tens of millions of potential responses in parallel. Consider it because the mannequin exploring completely different paths — some good, some dangerous. Our aim is to encourage it to take the higher paths extra typically.

To do that, we prepare the mannequin on the sequences of tokens that result in higher outcomes. In contrast to supervised fine-tuning, the place human specialists present labeled knowledge, reinforcement studying permits the mannequin to study from itself.

The mannequin discovers which responses work finest, and after every coaching step, we replace its parameters. Over time, this makes the mannequin extra more likely to produce high-quality solutions when given related prompts sooner or later.

However how can we decide which responses are finest? And the way a lot RL ought to we do? The main points are tough, and getting them proper is just not trivial.

RL is just not “new” — It may possibly surpass human experience (AlphaGo, 2016)

An ideal instance of RL’s energy is DeepMind’s AlphaGo, the primary AI to defeat knowledgeable Go participant and later surpass human-level play.

Within the 2016 Nature paper (graph beneath), when a mannequin was skilled purely by SFT (giving the mannequin tons of excellent examples to mimic from), the mannequin was capable of attain human-level efficiency, however by no means surpass it.

The dotted line represents Lee Sedol’s efficiency — the perfect Go participant on this planet.

It is because SFT is about replication, not innovation — it doesn’t enable the mannequin to find new methods past human information.

Nonetheless, RL enabled AlphaGo to play towards itself, refine its methods, and finally exceed human experience (blue line).

Picture taken from AlphaGo 2016 paper

RL represents an thrilling frontier in AI — the place fashions can discover methods past human creativeness once we prepare it on a various and difficult pool of issues to refine it’s pondering methods.

RL foundations recap

Let’s shortly recap the important thing parts of a typical RL setup:

Agent — The learner or choice maker. It observes the present scenario (state), chooses an motion, after which updates its behaviour based mostly on the result (reward).
Surroundings — The exterior system wherein the agent operates.
State — A snapshot of the setting at a given step t.

At every timestamp, the agent performs an motion within the setting that may change the setting’s state to a brand new one. The agent can even obtain suggestions indicating how good or dangerous the motion was.

This suggestions is named a reward, and is represented in a numerical type. A constructive reward encourages that behaviour, and a destructive reward discourages it.

By utilizing suggestions from completely different states and actions, the agent regularly learns the optimum technique to maximise the overall reward over time.

Coverage

The coverage is the agent’s technique. If the agent follows a very good coverage, it is going to constantly make good choices, resulting in increased rewards over many steps.

In mathematical phrases, it’s a perform that determines the chance of various outputs for a given state — (πθ(a|s)).

Worth perform

An estimate of how good it’s to be in a sure state, contemplating the long run anticipated reward. For an LLM, the reward would possibly come from human suggestions or a reward mannequin.

Actor-Critic structure

It’s a fashionable RL setup that mixes two parts:

Actor — Learns and updates the coverage (πθ), deciding which motion to soak up every state.
Critic — Evaluates the worth perform (V(s)) to present suggestions to the actor on whether or not its chosen actions are resulting in good outcomes.

The way it works:

The actor picks an motion based mostly on its present coverage.
The critic evaluates the result (reward + subsequent state) and updates its worth estimate.
The critic’s suggestions helps the actor refine its coverage in order that future actions result in increased rewards.

Placing all of it collectively for LLMs

The state will be the present textual content (immediate or dialog), and the motion will be the following token to generate. A reward mannequin (eg. human suggestions), tells the mannequin how good or dangerous it’s generated textual content is.

The coverage is the mannequin’s technique for selecting the following token, whereas the worth perform estimates how useful the present textual content context is, by way of finally producing top quality responses.

DeepSeek-R1 (printed 22 Jan 2025)

To spotlight RL’s significance, let’s discover Deepseek-R1, a reasoning mannequin attaining top-tier efficiency whereas remaining open-source. The paper launched two fashions: DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero was skilled solely through large-scale RL, skipping supervised fine-tuning (SFT).
DeepSeek-R1 builds on it, addressing encountered challenges.

Deepseek R1 is likely one of the most superb and spectacular breakthroughs I’ve ever seen — and as open supply, a profound reward to the world. 🤖🫡

— Marc Andreessen 🇺🇸 (@pmarca) January 24, 2025

Let’s dive into a few of these key factors.

1. RL algo: Group Relative Coverage Optimisation (GRPO)

One key recreation altering RL algorithm is Group Relative Coverage Optimisation (GRPO), a variant of the broadly fashionable Proximal Coverage Optimisation (PPO). GRPO was launched within the DeepSeekMath paper in Feb 2024.

Why GRPO over PPO?

PPO struggles with reasoning duties attributable to:

Dependency on a critic mannequin.
PPO wants a separate critic mannequin, successfully doubling reminiscence and compute.
Coaching the critic will be complicated for nuanced or subjective duties.
Excessive computational value as RL pipelines demand substantial assets to guage and optimise responses.
Absolute reward evaluations
If you depend on an absolute reward — that means there’s a single commonplace or metric to evaluate whether or not a solution is “good” or “dangerous” — it may be arduous to seize the nuances of open-ended, numerous duties throughout completely different reasoning domains.

How GRPO addressed these challenges:

GRPO eliminates the critic mannequin through the use of relative analysis — responses are in contrast inside a bunch moderately than judged by a hard and fast commonplace.

Think about college students fixing an issue. As a substitute of a trainer grading them individually, they examine solutions, studying from one another. Over time, efficiency converges towards increased high quality.

How does GRPO match into the entire coaching course of?

GRPO modifies how loss is calculated whereas protecting different coaching steps unchanged:

Collect knowledge (queries + responses)
– For LLMs, queries are like questions
– The previous coverage (older snapshot of the mannequin) generates a number of candidate solutions for every question
Assign rewards — every response within the group is scored (the “reward”).
Compute the GRPO loss
Historically, you’ll compute a loss — which exhibits the deviation between the mannequin prediction and the true label.
In GRPO, nevertheless, you measure:
a) How doubtless is the brand new coverage to provide previous responses?
b) Are these responses comparatively higher or worse?
c) Apply clipping to stop excessive updates.
This yields a scalar loss.
Again propagation + gradient descent
– Again propagation calculates how every parameter contributed to loss
– Gradient descent updates these parameters to scale back the loss
– Over many iterations, this regularly shifts the brand new coverage to want increased reward responses
Replace the previous coverage often to match the brand new coverage.
This refreshes the baseline for the following spherical of comparisons.

2. Chain of thought (CoT)

Conventional LLM coaching follows pre-training → SFT → RL. Nonetheless, DeepSeek-R1-Zero skipped SFT, permitting the mannequin to instantly discover CoT reasoning.

Like people pondering by way of a tricky query, CoT permits fashions to interrupt issues into intermediate steps, boosting complicated reasoning capabilities. OpenAI’s o1 mannequin additionally leverages this, as famous in its September 2024 report: o1’s efficiency improves with extra RL (train-time compute) and extra reasoning time (test-time compute).

DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning.

A key graph (beneath) within the paper confirmed elevated pondering throughout coaching, resulting in longer (extra tokens), extra detailed and higher responses.

With out express programming, it started revisiting previous reasoning steps, bettering accuracy. This highlights chain-of-thought reasoning as an emergent property of RL coaching.

The mannequin additionally had an “aha second” (beneath) — an interesting instance of how RL can result in surprising and complicated outcomes.

Notice: In contrast to DeepSeek-R1, OpenAI doesn’t present full actual reasoning chains of thought in o1 as they’re involved a couple of distillation danger — the place somebody is available in and tries to mimic these reasoning traces and get better lots of the reasoning efficiency by simply imitating. As a substitute, o1 simply summaries of those chains of ideas.

Reinforcement studying with Human Suggestions (RLHF)

For duties with verifiable outputs (e.g., math issues, factual Q&A), AI responses will be simply evaluated. However what about areas like summarisation or artistic writing, the place there’s no single “appropriate” reply?

That is the place human suggestions is available in — however naïve RL approaches are unscalable.

Let’s have a look at the naive strategy with some arbitrary numbers.

That’s one billion human evaluations wanted! That is too pricey, gradual and unscalable. Therefore, a better answer is to coach an AI “reward mannequin” to study human preferences, dramatically lowering human effort.

Rating responses can be simpler and extra intuitive than absolute scoring.

Upsides of RLHF

Could be utilized to any area, together with artistic writing, poetry, summarisation, and different open-ended duties.
Rating outputs is far simpler for human labellers than producing artistic outputs themselves.

Downsides of RLHF

The reward mannequin is an approximation — it might not completely replicate human preferences.
RL is nice at gaming the reward mannequin — if run for too lengthy, the mannequin would possibly exploit loopholes, producing nonsensical outputs that also get excessive scores.

Do notice that Rlhf is just not the identical as conventional RL.

For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and uncover novel methods. RLHF, alternatively, is extra like a fine-tuning step to align fashions with human preferences.

Conclusion

And that’s a wrap! I hope you loved Half 2 🙂 When you haven’t already learn Half 1 — do test it out right here.

Bought questions or concepts for what I ought to cowl subsequent? Drop them within the feedback — I’d love to listen to your ideas. See you within the subsequent article!

How LLMs Work: Reinforcement Studying, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

What’s the aim of reinforcement studying (RL)?

Instinct behind RL

RL is just not “new” — It may possibly surpass human experience (AlphaGo, 2016)

RL foundations recap

Coverage

Worth perform

Actor-Critic structure

Placing all of it collectively for LLMs

DeepSeek-R1 (printed 22 Jan 2025)

1. RL algo: Group Relative Coverage Optimisation (GRPO)

How does GRPO match into the entire coaching course of?

2. Chain of thought (CoT)

Reinforcement studying with Human Suggestions (RLHF)

Upsides of RLHF

Downsides of RLHF

Conclusion

Related Articles

This Week’s Superior Tech Tales From Across the Internet (Via November 29)

The Machine Studying and Deep Studying “Creation Calendar” Collection: The Blueprint

RMIT and CSIR-NCL researchers evaluation advances in 3D printing of polyimides

LEAVE A REPLY Cancel reply

Latest Articles

This Week’s Superior Tech Tales From Across the Internet (Via November 29)

The Machine Studying and Deep Studying “Creation Calendar” Collection: The Blueprint

RMIT and CSIR-NCL researchers evaluation advances in 3D printing of polyimides

3D Printing Information Briefs, November 29, 2025: Submarine Industrial Base, Operating Shoe, & Extra – 3DPrint.com

Intro to Nest.js: Server-side JavaScript growth on Node

About US