Large language models don’t come “aligned” out of the box. To steer them toward desired behaviors, we use a mix of tools: supervised learning, preference modeling, reinforcement learning, and safety constraints like KL penalties and clipping.
This post gives you:
Goal: By the end, you’ll understand how these components fit together and you’ll be ready to design your own custom loss function to align a language model toward your goals.
SFT (teacher forcing)
DPO (paired preferences, fixed reference)
KTO (unpaired ± with margins)
Let Δ = log π_θ(y|x) − log π_ref(y|x).
ORPO (positives‑only, odds tilt beyond SFT)
PPO (RL with clipping, token‑wise)
GRPO / RLOO (group baselines, no critic)
For a prompt, sample K responses with rewards r_i, then
Reward Model (pairwise Bradley–Terry)
π_θ(a|s)
— the policy (your model).π_ref
— fixed reference (SFT/base). We do not change this model while training.ρ_t
— importance ratio = π_θ/π_old at token t; PPO clips it.r, r_t
— reward (terminal RM score, graded rules, PRM steps, safety bonuses, KL shaping).A, Â_t
— advantage (how much better than expected). Token‑wise via GAE or sequence‑wise via group baselines.V, V̂
— value and its target (critic).H[π]
— entropy (small bonus for exploration/diversity).KL(π_θ || π_ref)
— proximity to reference (guardrail).β, λ, α, ε
— strength knobs (preference temp, KL weight, odds/margin slopes, PPO clip).Idea: Both raise prob where behavior is good and lower it where it’s bad. They differ in credit granularity and scaffolding.
Â_t
(GAE) × ratio ρ_t
; PPO clips ρ_t
.V
.Â_t
, watch KL.r_i
→ group advantage A_i = r_i − mean_others
; apply A_i
to all tokens.Quick choose:
Stability tips: keep KL‑to‑ref, small entropy; clip ratios (token/sequence); length‑normalize if needed.
Log: reward, KL, entropy, length; (token‑wise) value loss & explained var; (sequence‑wise) win‑rate and group variance.
ρ
.π_ref
(safety/fluency), or to π_old
(proximal).Now let’s actually build something cool with this stuff — through the story of a math student we want to coach.
Our student (the current model) can handle easy algebra but struggles with harder ideas. If we only reward final correctness, he might just memorize surface patterns … see sin 45°
often enough and he blurts the right number without understanding. We don’t want a pattern‑guesser; we want a reasoner.
So we’ll coach him like a real tutor would:
Put together, this is a custom RL trainer that fits our story: completion‑level rewards for the outcome, entropy for gentle exploration, and KL to keep the student aligned with the teacher — using a group baseline so we don’t need a critic.
Group‑relative advantage for K samples on the same problem:
Sequence ratio (proximal control):
Proximity & exploration terms (token‑wise expectations):
lam **Objective (maximize)** :Training loss (minimize):
Notes: ε
keeps updates safe; tune λ
to a target per‑token KL (≈ 0.02–0.1); set a small c_H
(≈ 1e−3) so the student explores without getting noisy.
for x in batch_prompts:
# Sample K candidates from π_old (the student before this step)
Y = [ sample(pi_old, x, max_new_tokens) for _ in range(K) ]
# Grade each completion (exact answer / unit tests / graded score)
R = [ reward(x, y) for y in Y ] # in [0,1] or graded
# Group-relative advantages
mean_others = [ (sum(R)-R[i])/(K-1) for i in range(K) ]
A = [ R[i]-mean_others[i] for i in range(K) ]
# Ratios and objective
rho = [ exp(logp(pi_theta,x,y) - logp(pi_old,x,y)) for y in Y ]
J_actor = mean( min(rho[i]*A[i], clip(rho[i],1-eps,1+eps)*A[i]) for i in range(K) )
KL_ref = mean_tokenwise_KL(pi_theta, pi_ref, contexts_from(Y))
H = mean_tokenwise_entropy(pi_theta, contexts_from(Y))
J = J_actor - λ*KL_ref + cH*H
loss = -J
loss.backward();
optimizer.step();
optimizer.zero_grad()
# tricky maybe after 10 epochs
pi_old = maybe_update_sampling_policy(pi_theta, every_n_steps)
Milu, Catalin. Friendly Introduction to LLM Alignment With Reusable Building Blocks. 2025. https://1y33.github.io/blog/alignment_of_llms/