The Promise and the Problem
ChatGPT's breakthrough wasn't just better answers—it was alignment. The system learned to produce responses humans prefer, not just responses that predict the next token accurately. For many BYU-Idaho faculty and staff exploring AI integration, this marked the moment when large language models (LLMs) finally became practical tools rather than interesting curiosities.
| Note: We discuss the basics of alignment in the companion article What is Model Alignment? |
That leap came from three advances working together:
- Stronger base models trained at scale on massive text corpora
- Instruction-tuning that taught models to follow directions
- Alignment methods like Reinforcement Learning from Human Feedback (RLHF), which push models toward responses people prefer
Here's the subtle risk: RLHF optimizes for "what humans like," not directly for "what is true." In everyday tasks, those goals overlap. In high-stakes or specialized tasks—policy interpretation, compliance requirements, factual claims—they can diverge.
RLHF in One Page
A base language model learns to predict the next token. That produces fluent text, but not necessarily safe, cooperative, or instruction-following behavior.
RLHF typically adds three steps:
- Supervised fine-tuning (SFT) on high-quality demonstrations
- Preference data where humans rank outputs, producing a learned "reward model"
- Reinforcement learning to optimize the model's outputs to score higher under that reward model
In the InstructGPT work, OpenAI showed that RLHF-style training made outputs much more preferred by human evaluators, even when the aligned model was far smaller than the base model it was compared against.
The Real Tradeoff
Alignment improves usefulness, but it can also create pressure toward responses that sound helpful even when the model is uncertain.
In the InstructGPT study, RLHF-trained models were rated as better at following instructions and (in their closed-domain evaluation) "making up facts" less often than base GPT-3. They also showed measured improvements on TruthfulQA versus GPT-3 in that setup.
So what's the problem?
The problem is not "RLHF always increases hallucinations." The more accurate framing is this:
Preference optimization rewards confidence, agreeableness, and plausibility. When those qualities correlate with truth, you get better answers. When they diverge—a student asking for policy clarification, a faculty member checking compliance requirements—you get errors that sound authoritative.
That's the operational risk: alignment makes mistakes harder to detect. The model becomes more pleasant and confident, which can obscure when it's wrong.
Why This Happens
Sycophancy: The Agreeable Assistant
Research from Anthropic shows that when a response matches a user's stated views, it is more likely to be preferred, and optimizing for those preferences can sometimes sacrifice truthfulness in favor of agreement.
In practice: if a prompt "leans" toward a conclusion, the model may mirror that lean unless the workflow forces evidence-based grounding.
Reward Model Blind Spots and "Reward Hacking"
Reward models are imperfect. If they rely on surface cues—clarity, confidence, tone—the system can drift toward outputs that look good to evaluators without being reliably correct. This broader "reward over-optimization" problem is well-studied in RLHF contexts.
Uncertainty Is Expensive
Saying "I don't know" is often less satisfying to users than a confident answer. Unless training and evaluation explicitly reward calibrated uncertainty, models learn that "being helpful" beats "being cautious."
The Scale of the Challenge
"Hallucination rate" is not a single universal number. It varies by:
- Task type (summarization vs. open QA vs. extraction)
- Domain (general information vs. specialized science/finance)
- Whether the model is grounded in sources
- How you measure hallucination
For example, HaluEval 2.0 was built specifically to study detection, sources, and mitigation across settings, reflecting how broad and context-dependent the problem is.
Public leaderboards also show wide variance depending on methodology. Vectara's hallucination leaderboard, for instance, measures faithfulness in document summarization using an automatic evaluator and reports non-trivial hallucination rates even for strong models.
Practical Takeaways for BYU-Idaho
This is the part that matters operationally: how to get real value while staying truthful and safe.
1. Use LLMs Differently Depending on Stakes
If the output will inform policy, grades, medical advice, compliance, or public communication, treat the model as a drafting assistant, not a source of truth.
A reliable default rule:
- Low-stakes writing tasks (rewrite, tone, brainstorming): LLM-first is fine
- Factual claims (numbers, dates, citations, "what does policy say"): source-first, LLM-second
2. Require Grounding When Accuracy Matters
For any task where truth matters, use one of these patterns:
- RAG + "cite-or-refuse": "Use only the provided documents. If the answer isn't in them, say so."
- Quote-then-summarize: Ask for short quotes/snippets first, then a summary based only on those quotes
- Structured outputs + validation: Force the model into fields (Answer, Evidence, Unknowns, Next Steps) so uncertainty can't hide in prose
3. Add a Second Pass That Is Allowed to Disagree
A simple workflow that works well in practice:
- Draft answer
- Critique pass: "List every claim that might be false or needs a source"
- Verify pass: Check against docs, links, or a human reviewer
Even without specialized tooling, this dramatically reduces confident nonsense.
4. Teach the "Politeness Trap" Explicitly
Students and staff should learn one mental model:
Fluent language is not evidence.
If a response includes facts, require:
- A source
- A quote
- A link to the system of record
5. Decide Ahead of Time What "Good" Looks Like
Before deploying an AI workflow, write down:
- What errors are unacceptable?
- What must always be cited?
- When should the assistant refuse?
- Who is accountable for final approval?
That turns "AI risk" into a manageable QA checklist.
The Path Forward
Research is actively improving alignment methods and hallucination mitigation through better reward modeling, grounding (RAG), and alternative alignment approaches like constitutional-style constraints. Meanwhile, the most effective near-term strategy for BYU-Idaho is process design: make truth cheap and mistakes expensive through grounding, structured outputs, and verification loops.
Ultimately, the goal is not to fear these tools, but to use them with the same information-literacy discipline we already apply to web sources: verify, cite, and keep humans in the loop when consequences are real.
Key Sources
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT)
- Sharma et al. (Anthropic), "Towards Understanding Sycophancy in Language Models"
- Li et al., ACL 2024, "The Dawn After the Dark..." (HaluEval 2.0)
- Vectara hallucination leaderboard methodology and results (summarization faithfulness)