Teaching with Lies

Abstract

Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of up to 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.

Introduction

Large language models (LLMs) have achieved impressive performance across numerous NLP tasks, yet their deployment is limited by a tendency to produce fluent but factually incorrect "hallucinations." Such errors erode trust and carry serious risks in domains with LLM applications like healthcare, software development, and law. Although various detection and mitigation strategies—often based on external fact-checkers or simplistic negative samples—have been proposed, they struggle to identify sophisticated, plausibly crafted falsehoods.

To address these challenges, we introduce a novel alignment strategy leveraging Direct Preference Optimization (DPO) enhanced through a curriculum learning approach specifically tailored for hallucination detection. Our approach incorporates high quality hallucinated samples as negative samples into the alignment process instead of the usual low quality negative samples that are often selected from failed generations.

Methodology

Our curriculum-based DPO framework progressively selects hallucinated samples of increasing difficulty ranges obtained from fact verification models to enhance alignment training. The pipeline follows these key steps:

Score Sample Difficulty: Using MiniCheck grounded factuality scores to evaluate how well each hallucinated output is supported by its context.
Build Preference Pairs: Pairing gold references (chosen) with top-ranked hallucinations (rejected) to form preference pairs for DPO.
Curriculum Learning: Training proceeds from easier samples (high grounding scores) to progressively harder ones (low grounding scores).
DPO Optimization: Optimizing the DPO objective with high-quality vetted negatives rather than arbitrary failures.

This structured difficulty scaling ensures stable and incremental learning, enabling the model to develop robust decision boundaries for hallucination detection.

Results

Our HaluCheck models demonstrate significant improvements over baseline models across multiple benchmarks. Below we present comprehensive evaluation results.

HaluCheck vs Baseline Performance

Model	Avg F1	MedHallu			HaluEval
Model	Avg F1	F1	Precision	Accuracy	F1	Precision	Accuracy
Qwen-2.5 1.5B	0.464	0.227	0.642	0.525	0.701	0.568	0.610
Llama-3.2 1B	0.237	0.108	0.406	0.494	0.366	0.450	0.466
Llama-3.2 3B	0.612	0.499	0.696	0.566	0.726	0.743	0.732
Llama-3.1 8B	0.571	0.522	0.791	0.608	0.620	0.903	0.711
GPT-4o	0.799	0.737	0.723	0.772	0.862	0.896	0.867
HaluCheck 1B	0.637	0.664	0.511	0.527	0.611	0.481	0.468
HaluCheck 3B	0.756	0.759	0.845	0.782	0.753	0.857	0.767

Zero-shot Evaluation

Model	DROP	CovidQA	PubMedQA	Average
Llama 3.2 3B	52.50	56.10	55.20	54.60
HaluCheck 3B	57.30	62.50	57.70	59.16
GPT-3.5-Turbo	57.20	56.70	62.80	58.90

HaluCheck 3B demonstrates strong zero-shot performance, outperforming both its base model and GPT-3.5-Turbo on average across three unseen benchmarks.

Key Contributions

Curriculum Learning

Novel curriculum-based sampling strategy that progressively selects hallucinated samples of increasing difficulty ranges.

HaluCheck Models

Suite of 1B–3B parameter models aligned with our DPO curriculum that outperform state-of-the-art LLMs.

Strong Transferability

Demonstrated robustness across multiple benchmarks and domains, including zero-shot evaluation.

Conclusion

We present HaluCheck, a curriculum-guided Direct Preference Optimization (DPO) framework for training LLMs for reliable hallucination detection. A key contribution lies in replacing generic, model-generated failures with carefully curated, difficulty-ranked hallucinated samples as negative preferences during DPO alignment.

This structured curriculum yields consistent gains, with HaluCheck 3B achieving up to 24% relative improvement in F1 scores while remaining competitive with far larger models such as GPT-4o. The strong zero-shot performance further validates that difficulty-aware negative sampling markedly strengthens the robustness of smaller language models.

Our results demonstrate that thoughtful curriculum design and high-quality negative examples can enable smaller models to achieve state-of-the-art performance in hallucination detection, opening new avenues for efficient and reliable LLM deployment in critical applications.

BibTeX Citation

@article{pandit2025teaching,
  title={Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection}, 
  author={Shrey Pandit and Ashwin Vinod and Liu Leqi and Ying Ding},
  journal={arXiv preprint arXiv:2505.17558},
  year={2025}
}