Reinforcement Learning from Human Feedback (RLHF): Why...

AIRachel ThompsonMarch 5, 20266 min read

I asked ChatGPT to recommend a treatment for my daughter’s persistent cough in December 2022. It suggested a combination of honey and over-the-counter cough suppressants – reasonable enough. Then I asked the same question two weeks later, after OpenAI had deployed additional safety updates. This time, it refused to give medical advice and told me to consult a pediatrician. That shift didn’t happen by accident. Between those two conversations, hundreds of human raters had flagged similar medical queries and taught the model where the guardrails belonged.

In This Article[hide]

What Actually Happens During RLHF Training
Why Models Give Dangerous Advice Without Human Oversight
The Economics Behind 40,000 Rating Hours
How Human Raters Actually Make Decisions
What This Means for You Using AI Tools
Sources and References

What Actually Happens During RLHF Training

Think of RLHF as teaching a brilliant but socially clueless teenager. The base language model – trained on billions of web pages – knows facts, grammar, and patterns. It doesn’t know when to shut up. OpenAI’s InstructGPT paper from 2022 documented how they collected 40,000+ comparison rankings from human contractors. These raters reviewed model outputs and picked which responses were helpful, honest, and harmless.

The process runs in three stages. First, contractors write demonstrations of ideal responses – showing the model what good looks like. Second, they rank multiple model outputs for the same prompt from best to worst. Third, the system uses those rankings to train a reward model that predicts what humans will prefer. That reward model then guides the language model through reinforcement learning, the same technique DeepMind used to teach AlphaGo.

Here’s what surprised me: the human raters aren’t AI experts. According to a 2023 investigation by The Verge, many work for contractors like Scale AI and Labelbox, earning $15-20 per hour. They follow detailed guidelines, but ultimately they’re making subjective judgment calls about tone, safety, and usefulness. Your ChatGPT experience is shaped by the collective preferences of a few thousand gig workers.

Why Models Give Dangerous Advice Without Human Oversight

Base language models predict the next word. That’s it. They don’t distinguish between a Reddit shitpost and peer-reviewed medical literature – both are just text patterns. When Meta released Galactica in November 2022, a model trained on scientific papers, it lasted three days before they pulled it offline. Users discovered it would confidently generate fake research citations and dangerously incorrect chemistry formulas. The model was technically excellent at mimicking scientific writing. It had zero concept of truth versus fabrication.

The safety problem gets worse with capability. GPT-4 can write convincing phishing emails, generate exploit code, and produce sophisticated disinformation. Without RLHF, it would do all of this cheerfully upon request. OpenAI’s GPT-4 system card documented that the base model, before safety training, complied with requests to generate racist content 82% of the time. After RLHF, that dropped to less than 1%. The difference is entirely human feedback about acceptable boundaries.

The Economics Behind 40,000 Rating Hours

Let’s do the math. At 40,000 hours and $18/hour average contractor pay, that’s $720,000 in direct labor costs just for the rating work – not counting the infrastructure, quality control, or the compute costs for training. And that’s a one-time snapshot. Anthropic, makers of Claude, reportedly employs hundreds of contractors continuously rating outputs. This isn’t a setup-and-forget process. New edge cases emerge constantly.

“We found that model behavior degrades over time without ongoing human feedback. Users find novel ways to elicit unwanted behaviors faster than we can predict them.” – Anthropic Constitutional AI paper, 2023

Compare this to traditional software quality assurance, where you write test suites once and run them automatically. AI safety requires continuous human judgment because the input space is infinite. Someone will eventually ask your chatbot how to hotwire a Ring camera or hack an Eero mesh router. You can’t write unit tests for every possible malicious prompt, so you need humans constantly evaluating new model responses and adjusting the reward model accordingly.

How Human Raters Actually Make Decisions

I spoke with a former contractor who rated outputs for a major AI lab in 2023 (they requested anonymity due to NDA restrictions). The guidelines ran to 70+ pages. Some rules were objective: never provide instructions for illegal activities, always cite sources when making factual claims, refuse medical diagnoses. Others required interpretation. What counts as “helpful” when someone asks about cryptocurrency investing? Is it more helpful to explain the risks or to provide the information they requested?

The raters see outputs side-by-side and pick winners. Sometimes the choice is obvious – one response is factually wrong or offensive. Often it’s subtle preference calls. Should the model be formal or conversational? Concise or thorough? When discussing Amazon’s $5.8 million FTC settlement for Ring employees accessing private customer footage, should the model emphasize the privacy violation or note that Amazon implemented additional safeguards? These micro-decisions, aggregated across thousands of examples, shape the model’s personality and values. The model learns patterns: privacy violations are serious, security matters, corporate PR language gets downranked.

What This Means for You Using AI Tools

Understanding RLHF changes how you should interact with AI assistants:

The model reflects aggregate human preferences, not objective truth. If human raters collectively prefer diplomatic answers over blunt ones, that’s what you’ll get – even when bluntness would be more useful.
Safety boundaries are conservative by design. When I asked ChatGPT about smart home privacy trade-offs between convenience and surveillance, it gave me both sides but leaned heavily toward the risks. That’s intentional. Human raters penalize outputs that could enable harm, even indirectly.
Edge cases expose the training gaps. Novel requests that human raters haven’t seen produce unpredictable outputs. The model is interpolating from examples, sometimes poorly.
Your feedback matters, eventually. Those thumbs-up/down buttons feed back into training data. You’re a human rater too, just unpaid.

The practical implication: treat AI outputs as suggestions from a smart intern who’s read everything but lacks judgment. Double-check facts, ignore medical advice, verify code. The 40,000 hours of human rating make ChatGPT safer than the base model. They don’t make it reliable.

Sources and References

1. Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” arXiv preprint arXiv:2203.02155 (2022). Published by OpenAI, documenting the InstructGPT methodology.

2. Bai, Yuntao, et al. “Constitutional AI: Harmlessness from AI Feedback.” arXiv preprint arXiv:2212.08073 (2023). Anthropic’s research on scaling human feedback methods.

3. OpenAI. “GPT-4 System Card.” OpenAI Technical Report (2023). Details safety testing results and RLHF implementation for GPT-4.

4. Perrigo, Billy. “Exclusive: The $2 Per Hour Workers Who Made ChatGPT Safer.” Time Magazine (January 2023). Investigation into contractor labor conditions for RLHF rating work.

Written by Rachel Thompson

Software industry journalist covering open source, programming languages, and developer communities.

Rachel Thompson

Software industry journalist covering open source, programming languages, and developer communities.

View all posts

What Actually Happens During RLHF Training

Why Models Give Dangerous Advice Without Human Oversight

The Economics Behind 40,000 Rating Hours

How Human Raters Actually Make Decisions

What This Means for You Using AI Tools

Sources and References

Edge AI Is Moving Machine Learning to Your Phone: What 8 Months Running TensorFlow Lite Models Offline Taught Me About Latency and Privacy

Prompt Injection Attacks Are Breaking LLM Security: What 340 Red Team Tests Revealed About ChatGPT, Claude, and Gemini Vulnerabilities

Multimodal AI Doesn’t Understand Context Better Than Humans – It Just Processes More Data Faster

Rachel Thompson

Related Posts

How to Set Up a Password Manager Without Breaking Your Existing Workflow

Neural Architecture Search (NAS) Cut My Model Training Time by 14 Days: What AutoML Tools Like Google’s NASNet and Microsoft’s FLAML Actually Automate

Ring vs Google One VPN: Which Privacy Tool Actually Protects You in 2024?