Facebook iconHow To Evaluate LLM Hallucinations and Faithfulness - F22 Labs
Blogs/AI

How To Evaluate LLM Hallucinations and Faithfulness

Jul 2, 20258 Min Read
Written by Varsha G
How To Evaluate LLM Hallucinations and Faithfulness Hero

Large language models are widely used, and it’s important to make sure that the generated answers are accurate and provide correct information. Evaluating these aspects helps the developers and researchers to understand how reliable an LLM model is, especially in critical areas like healthcare, law, and education. 

The main goal is to avoid the wrong answers and make sure the model gives the correct and fact-based information. In this blog, let’s learn about faithfulness and hallucinations in detail and how to implement them practically.

What is Faithfulness in LLM?

Faithfulness means, the LLM model should not change the meaning of the information which was given. A faithful answer stays true to the fact and does not add, remove or change the important details. 

For example, a user asking “Who led the Salt March and why?” and in the source doc says, “Mahatma Gandhi led the Salt March in 1930 to protest against British rule in India.”, but the model says “Jawaharlal Nehru led the Salt March in 1942 to protest high taxes” is not correct, the model changes the fact. 

The LLM model returns the unfaithful answer. The model was given the correct information but provided an incorrect response. This is unfaithful; the model did not stick to the source.

A faithful response = No made-up information + directly backed by the source.

What are LLM Hallucinations?

The hallucinations happen when the large language model gives information that sounds correct but isn’t actually true or based on the true facts. For example, if you are asking something like “Explain about langchain”,  and LLM (gpt-3.5-turbo) replied with the name “LangChain is a company that provides pre-trained translation models for multilingual chatbots.” 

The model’s answer sounds believable, but it is not true. Since GPT-3.5-turbo was trained before the release of LangChain, it cannot know about the tool accurately. LangChain is not a company that builds translation models; it’s an open-source framework for developing LLM-powered apps. This is a hallucinated and unfaithful response because it changes the meaning and adds incorrect details that were never in the source.

The hallucination is a big problem, especially when people use the LLM model for important tasks like medical advice or legal help. That's why it is important to detect and reduce hallucinations. 

2 Types of Hallucinations in LLMs

  1. Factuality Hallucinations
  2. Faithfulness Hallucinations

What are Factuality Hallucinations in LLM?

Factuality hallucination happens when the LLM model gives information that sounds true but is factually wrong. For example, you ask questions like “Who is the president of India in 2024?”, the model replies “Amit Shah is the president of India.” 

The sentence sounds correct, but it's factually wrong. The real president is Droupadi Murmu in 2024. This is a factual hallucination. These kinds of mistakes are risky, especially in areas like news, education, and healthcare. That’s why checking for factual accuracy is very important when using AI.

Two kinds of factuality hallucinations

Factual Inconsistency : 

Factual Inconsistency means the model gives answers that don't match the facts in the original information. It may change, add, or remove important details. For example, you ask the model, “When was the COVID-19 vaccine first rolled out?”, and if  LLM replies, “The COVID-19 vaccine was first rolled out in 2021.”This is a factual inconsistency because the actual year is 2020, but it changed the year. 

Factual Fabrication : 

Factual Inconsistency means the model makes up information that is not in the original source and is not true. It creates facts that were never mentioned. 

For example, if the source says, “Unicorns are mythical creatures often described as white horses with a single horn on their forehead" and you ask the model, "Where do unicorns live" but it replies, "Unicorns live in the forests of Scotland and are often seen by travelers" this is factual fabrication. 

The model added a fake detail (about unicorn lives but its methodology character) that was not in the original text. This kind of error is risky because the answer sounds real but is actually made up.

What are Faithfulness Hallucinations in LLM?

Faithfulness Hallucinations happen when the model doesn’t stick to the input or instruction. 

Three kinds of faithfulness hallucinations

  1. Instruction Inconsistency → The model ignores the user's instructions.

Example: Instruction: “Translate this question to Spanish.” The model gives the answer in English instead.

  1. Context Inconsistency → The model says something that doesn’t match the provided information.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Example:If the context says, “Ananya Sharma was born in Chennai, India,” and the user asks, “Where was Ananya Sharma born?” but the model replies, “Ananya Sharma was born in Mumbai,” this is context inconsistency. 

The model had the correct information, but gave an answer that contradicts it. Even though the question is simple and the answer sounds real, it doesn’t match the provided details, which makes it incorrect.

  1. Logical Inconsistency → The model starts right, but makes a logical or calculation mistake. Example: In a math problem, it begins solving correctly, but messes up the final step.

Why Do We Need To Evaluate Hallucinations And Faithfulness? 

It is important to evaluate hallucinations and faithfulness to make sure that large language models (LLMs) give correct and trustworthy answers. Sometimes, LLMs can sound confident but give wrong or made-up information (hallucinations), or they may change the original meaning (lack of faithfulness). 

This can cause problems, especially in areas like healthcare, law, or education, where accurate information is very important. By checking how often LLMs make these mistakes, researchers and developers can improve their models to be more reliable, helpful, and safe to use in real-life situations.

How To Evaluate Hallucinations And Faithfulness In 7 Steps?

Human Evaluation – A person reads the LLM answer and checks if it’s correct and matches the source. This is the most accurate method.

Automatic Evaluation – Tools or models check if the LLM answer is supported by the source using techniques like: 

  • Similarity checks (comparing answer and source)
  • Fact-checking models (detecting unsupported claims)
  • NLI (Natural Language Inference) (seeing if the answer logically follows the source)

Code

This code creates a simple web app where you upload a PDF and ask a question. It reads the PDF, finds the most relevant sentence, and gives a sample answer. Then, it uses DeepEval to check how good the answer is. 

DeepEval tells if the answer has any made-up information (hallucination) or if it changes the meaning of the original text (faithfulness). It gives scores and reasons so you can understand how correct and reliable the answer is.


 Step 1: Import Dependencies

import gradio as gr
from sentence_transformers import SentenceTransformer
import torch
import fitz
import os
from openai import OpenAI
from deepeval.metrics import HallucinationMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from google.colab import userdata	

These libraries help us:

  • These are evaluation metrics from the DeepEval library. They help check if a model’s output is hallucinated (made-up) or faithful (true to the source).
  • Wraps the model’s input, output, and context into a test case that can be passed into evaluation metrics.
  •  Fitz from PyMuPDF helps read PDF files and extract text from them.

Step 2: Environment setup and loading the Sentence Embedding Model

# Set OpenAI key from Colab secrets
os.environ["OPENAI_API_KEY"] = userdata.get("OPEN_AI_API_KEY")


# Init OpenAI client
client = OpenAI()

model = SentenceTransformer('all-MiniLM-L6-v2')

It converts text (like questions or sentences) into numerical vectors for comparison.

Step 3: Initialize Evaluation Metrics

hallucination_metric = HallucinationMetric(threshold=0.5)
faithfulness_metric = FaithfulnessMetric()

We create two metrics:

  • hallucination_metric: Checks if the response contains any made-up facts.
  • faithfulness_metric: Checks if the response follows the context properly.

Step 4: Extract Text from PDF Function

    text = ""
    for page in doc:
        text += page.get_text()def extract_text_from_pdf(pdf_file):
    doc = fitz.open(pdf_file.name)  # FIX: open using file path


    return text

Open the uploaded PDF file using its filename. Loop through all the pages in the PDF and collect all the text into one string. Then return the complete text.

Step 5: Get GPT response

def get_llm_response(context, question):
    prompt = f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    response = client.chat.completions.create(
        model="gpt-4o-mini", 
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return response.choices[0].message.content.strip()
Suggested Reads- What are Temperature, Top_p, and Top_k in AI?

This function, get_llm_response, is used to ask a question to the LLM (like GPT-4o-mini) using a given context. It first creates a prompt by combining the context (a short part of the PDF related to the question) and the question itself. 

Then it sends this prompt to the LLM using the new chat.completions.create method. The model reads the context, understands the question, and generates an answer. Finally, the function returns that answer as a clean string. 

This function is helpful when you want the LLM to answer based on specific information, like from a PDF file.

 Step 6: Main Logic for PDF QA and Evaluation

def process_pdf_and_question(pdf_file, question):
    try:
        text = extract_text_from_pdf(pdf_file)
        reference_texts = [sent.strip() for sent in text.split(".") if sent.strip()]


        if not reference_texts:
            return "", "", "", "No text extracted from PDF.", ""


        question_embedding = model.encode(question, convert_to_tensor=True)


        best_score = -1.0
        best_context = ""
        for ref_text in reference_texts:
            ref_embedding = model.encode(ref_text, convert_to_tensor=True)
            similarity = torch.nn.functional.cosine_similarity(
                question_embedding.unsqueeze(0), ref_embedding.unsqueeze(0)
            ).item()
            if similarity > best_score:
                best_score = similarity
                best_context = ref_text


        model_output = get_llm_response(best_context, question)


        test_case = LLMTestCase(
            input=question,
            actual_output=model_output,
            context=[best_context],
            retrieval_context=[best_context]
        )
        hallucination_metric.measure(test_case)
        faithfulness_metric.measure(test_case)


        return (
            question,
            best_context,
            model_output,
            f"Hallucination Score: {hallucination_metric.score:.2f}\nReason: {hallucination_metric.reason}",
            f"Faithfulness Score: {faithfulness_metric.score:.2f}\nReason: {faithfulness_metric.reason}"
        )
    except Exception as e:
        return "", "", "", f"❌ Error: {str(e)}", ""

This function runs when a user uploads a PDF and asks a question. Extract all text from the PDF and then split it into sentences. If the PDF is not present, then return the error. Query_embedding is used to convert the user query into embeddings.

best_score = -1
best_context = ""

Initialise variables to track the best match.

for ref_text in reference_texts:
            ref_embedding = model.encode(ref_text, convert_to_tensor=True)
            similarity = torch.nn.functional.cosine_similarity(
                question_embedding.unsqueeze(0), ref_embedding.unsqueeze(0)
            ).item()

For each sentence from the PDF :

  • Convert it into an embedding
  • Compare it with the question using cosine similarity
  • Higher similarity = better match
 if similarity > best_score:
                best_score = similarity
                best_context = ref_text

If this sentence is the best match so far, update the best_score and best_context.

model_output = get_llm_response(best_context, question)

This line calls the LLM to get an answer. It sends the best context (the most relevant sentence from the PDF) and the question to the get_llm_response function, which returns the model's response based on that context.

Suggested Reads- Unlocking LLM Potential Through Function Calling
test_case = LLMTestCase(
            input=question,
            actual_output=model_output,
            context=[best_context],
            retrieval_context=[best_context]
        )

 Wrap the input question, response, and matched context into a test case object.

 hallucination_metric.measure(test_case)
 faithfulness_metric.measure(test_case)

Run the evaluation to generate:

  • Hallucination score (is anything made up?)
  • Faithfulness score (does it follow the context?)
return (
            question,
            best_context,
            model_output,
            f"Hallucination Score: {hallucination_metric.score:.2f}\nReason: {hallucination_metric.reason}",
            f"Faithfulness Score: {faithfulness_metric.score:.2f}\nReason: {faithfulness_metric.reason}"
        )

Send back everything:

  • Original question
  • Best-matched context
  • Fake response
  • Similarity score
  • Hallucination & Faithfulness result

Partner with Us for Success

Experience seamless collaboration and exceptional results.

 Step 7: Gradio Interface Setup

demo = gr.Interface(
    fn=process_pdf_and_question,
    inputs=[
        gr.File(label="Upload PDF", file_types=[".pdf"]),
        gr.Textbox(label="Ask a Question")
    ],
    outputs=[
        gr.Textbox(label="Your Question"),
        gr.Textbox(label="Best Matched Context"),
        gr.Textbox(label="LLM Output (OpenAI GPT)"),
        gr.Textbox(label="Hallucination Evaluation"),
        gr.Textbox(label="Faithfulness Evaluation")
    ],
    title="PDF QA + Hallucination & Faithfulness Checker (OpenAI GPT)",
    description="Upload a PDF and ask a question. The system finds the best context, gets a GPT answer, and evaluates hallucination & faithfulness using DeepEval."
)
demo.launch(share=True)
Output of LLM Hallucinations and Faithfulness

Conclusion 

The Deepeval tool helps users to check how trustworthy a response from an LLM is by using content from a PDF. With the help of DeepEval, it becomes easier to spot when the model makes up facts (hallucinations) or changes the original meaning (faithfulness issues). 

This kind of evaluation is important to build more accurate, safe, and reliable AI systems, especially when dealing with real-world documents and critical information. 

Happy learning!

Author-Varsha G
Varsha G

I'm an AI/ML Intern, passionate about building real-world applications using large language models, voice AI, and data privacy technologies.

Phone

Next for you

Qdrant vs Milvus: Which Vector Database Should You Choose? Cover

AI

Jul 18, 20259 min read

Qdrant vs Milvus: Which Vector Database Should You Choose?

Which vector database should you choose for your AI-powered application, Qdrant or Milvus? As the need for high-dimensional data storage grows in modern AI use cases like semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG), vector databases have become essential.  In this article, we compare Qdrant vs Milvus, two of the most popular vector databases, based on architecture, performance, and ideal use cases. You’ll get a practical breakdown of insertion speed, query

Voxtral-Mini 3B vs Whisper Large V3: Which One’s Faster? Cover

AI

Jul 18, 20254 min read

Voxtral-Mini 3B vs Whisper Large V3: Which One’s Faster?

Which speech-to-text model delivers faster and more accurate transcriptions Voxtral-Mini 3B or Whisper Large V3? We put Voxtral-Mini 3B and Whisper Large V3 head-to-head to find out which speech-to-text model performs better in real-world tasks. Using the same audio clips, we compared latency (speed) and word error rate (accuracy) to help you choose the right model for use cases like transcribing calls, meetings, or voice messages. As speech-to-text systems become smarter and more reliable, th

What is Google Gemini CLI & how to install and use it? Cover

AI

Jul 3, 20252 min read

What is Google Gemini CLI & how to install and use it?

Ever wish your terminal could help you debug, write code, or even run DevOps tasks, without switching tabs? Google’s new Gemini CLI might just do that. Launched in June 2025, Gemini CLI is an open-source command-line AI tool designed to act like your AI teammate, helping you write, debug, and understand code right from the command line. What is Gemini CLI? Gemini CLI is a smart AI assistant you can use directly in your terminal. It’s not just for chatting, it’s purpose-built for developers.