A Friendly Introduction to LLM Alignment With Reusable Building Blocks

Large language models don’t come “aligned” out of the box. To steer them toward desired behaviors, we use a mix of tools: supervised learning, preference modeling, reinforcement learning, and safety constraints like KL penalties and clipping.

This post gives you:

A mental model for alignment stages.
Minimal math snapshots (KaTeX) for core trainers.
A glossary of symbols (r, A, π_θ, KL, etc.).
A menu of building blocks you can mix and match.
Guidance on Token‑wise vs Sequence‑wise training.

Goal: By the end, you’ll understand how these components fit together and you’ll be ready to design your own custom loss function to align a language model toward your goals.

1) Different methods ( inspired from @hugginface TRL trainers )

SFT (teacher forcing)

Maximize the log‑probability of reference answers token‑by‑token.

DPO (paired preferences, fixed reference)

Push y⁺ up and y⁻ down relative to the reference
β controls strength.

KTO (unpaired ± with margins)
Let Δ = log π_θ(y|x) − log π_ref(y|x).

Positives go up until margin m₊
Negatives go down until m₋
Asymmetry via α,w.

ORPO (positives‑only, odds tilt beyond SFT)

SFT + odds‑ratio tilt vs reference
Guard with KL/entropy.

PPO (RL with clipping, token‑wise)

Update within a trust‑region
Use a value function; keep entropy; add KL.

GRPO / RLOO (group baselines, no critic)
For a prompt, sample K responses with rewards r_i, then

Advantage = “better than the rest”
Clip like PPO
add KL to reference.

Reward Model (pairwise Bradley–Terry)

Simple reward model based on positive and negative pairs

2) Symbols (quick glossary)

π_θ(a|s) — the policy (your model).
π_ref — fixed reference (SFT/base). We do not change this model while training.
ρ_t — importance ratio = π_θ/π_old at token t; PPO clips it.
r, r_t — reward (terminal RM score, graded rules, PRM steps, safety bonuses, KL shaping).
A, Â_t — advantage (how much better than expected). Token‑wise via GAE or sequence‑wise via group baselines.
V, V̂ — value and its target (critic).
H[π] — entropy (small bonus for exploration/diversity).
KL(π_θ || π_ref) — proximity to reference (guardrail).
β, λ, α, ε — strength knobs (preference temp, KL weight, odds/margin slopes, PPO clip).

3) Token‑wise vs Sequence‑wise

Idea: Both raise prob where behavior is good and lower it where it’s bad. They differ in credit granularity and scaffolding.

Token‑wise (PPO default)

How: per‑token advantage Â_t (GAE) × ratio ρ_t; PPO clips ρ_t.
When: need fine credit (format/facts), have a critic V.
Pros: precise, sample‑efficient with shaped rewards.
Cons: critic tuning; normalize Â_t, watch KL.
Use: terminal reward (GAE propagates) or dense/PRM shaping.

Sequence‑wise (GRPO/RLOO)

How: sample K responses → reward r_i → group advantage A_i = r_i − mean_others; apply A_i to all tokens.
When: clear sequence rewards (pass/fail, tests); no critic.
Pros: simple, stable, low‑variance baseline.
Cons: coarse credit; may need K=2–8.
Use: rule/PRM scores; optional PPO‑style clipping + KL‑to‑ref.

Quick choose:

Want fine control & have critic → Token‑wise.
No critic & K samples ok → Sequence‑wise.
Hybrid: start sequence‑wise, later switch to token‑wise.

Stability tips: keep KL‑to‑ref, small entropy; clip ratios (token/sequence); length‑normalize if needed.

Log: reward, KL, entropy, length; (token‑wise) value loss & explained var; (sequence‑wise) win‑rate and group variance.

4) Building Blocks (checklist)

Clipping (PPO): trust‑region on ρ.
Reward function: terminal RM score, graded rules, PRM steps, safety bonuses, KL shaping.
Positive/Negative pairs: preference data for DPO or RM training.
Unpaired labels: KTO (+/− with margins), ORPO (positives‑only tilt).
KL penalties: to π_ref (safety/fluency), or to π_old (proximal).
Entropy bonus: keep diversity, reduce collapse.
Baselines: value function (V) or group baseline (RLOO/GRPO).
Advantage estimation: GAE (token) vs group (sequence).
Ratios: token‑wise vs sequence‑wise; clip them.
Length controls: normalize by length; cap tokens; repetition penalties.
Scheduling: start KL‑heavy/SFT‑heavy → gradually increase preference/RL strength.

5) Lets build something !

Now let’s actually build something cool with this stuff — through the story of a math student we want to coach.

Our student (the current model) can handle easy algebra but struggles with harder ideas. If we only reward final correctness, he might just memorize surface patterns … see sin 45° often enough and he blurts the right number without understanding. We don’t want a pattern‑guesser; we want a reasoner.

So we’ll coach him like a real tutor would:

Give him good worked solutions to imitate so he learns format and methods.
Let him try multiple approaches on each problem, but don’t let him wander too far.
Provide clear feedback on the final result (and optionally partial credit).
Keep him near a strong teacher’s style, and add a bit of freedom to explore.

Characters in the story

The student → our policy (π_θ).
The teacher → a fixed reference (π_ref) (SFT or a stronger math model).
The problem → (x).
A good worked solution → (y^+) (for warm‑up imitation).
The grader → a rule or unit‑test that returns a completion‑level reward.

What we let him do

Explore: sample several candidate solutions per problem.
Clip wild moves: keep updates in a safe range (proximal clipping).
Match the teacher’s style: add a KL‑to‑reference barrier.
Earn reward: score the full completion (exact answer, unit tests, or a graded score).

Put together, this is a custom RL trainer that fits our story: completion‑level rewards for the outcome, entropy for gentle exploration, and KL to keep the student aligned with the teacher — using a group baseline so we don’t need a critic.

Lets do the math :

Group‑relative advantage for K samples on the same problem:

Sequence ratio (proximal control):

Proximity & exploration terms (token‑wise expectations):

lam **Objective (maximize)** :

Training loss (minimize):

Notes: ε keeps updates safe; tune λ to a target per‑token KL (≈ 0.02–0.1); set a small c_H (≈ 1e−3) so the student explores without getting noisy.

Pseudo‑training loop

for x in batch_prompts:
    # Sample K candidates from π_old (the student before this step)
    Y = [ sample(pi_old, x, max_new_tokens) for _ in range(K) ]
    
    # Grade each completion (exact answer / unit tests / graded score)
    R = [ reward(x, y) for y in Y ]  # in [0,1] or graded
    
    # Group-relative advantages
    mean_others = [ (sum(R)-R[i])/(K-1) for i in range(K) ]
    A = [ R[i]-mean_others[i] for i in range(K) ]
    
    # Ratios and objective
    rho = [ exp(logp(pi_theta,x,y) - logp(pi_old,x,y)) for y in Y ]
    
    J_actor = mean( min(rho[i]*A[i], clip(rho[i],1-eps,1+eps)*A[i]) for i in range(K) )
    
    KL_ref = mean_tokenwise_KL(pi_theta, pi_ref, contexts_from(Y))
    
    H = mean_tokenwise_entropy(pi_theta, contexts_from(Y))
    
    J = J_actor - λ*KL_ref + cH*H
    loss = -J
    
    loss.backward(); 
    optimizer.step();
    optimizer.zero_grad()
    
    # tricky maybe after 10 epochs 
    pi_old = maybe_update_sampling_policy(pi_theta, every_n_steps)

Why it fits the story

Exploration: sampling K solutions + small entropy HH.
Clipping: the clip(ρ, 1−ε, 1+ε) bracket — keeps leaps modest.
Match the teacher: the KL term to πrefπref.
Reward on the whole solution: R(x,y)R(x,y) keeps the student focused on getting the answer and method right, not just memorizing a pattern.

Milu, Catalin. Friendly Introduction to LLM Alignment With Reusable Building Blocks. 2025. https://1y33.github.io/blog/alignment_of_llms/