Absortio

Email → Summary → Bookmark → Email

GitHub - KRLabsOrg/LettuceDetect: LettuceDetect is a hallucination detection framework for RAG applications.

Extracto

LettuceDetect is a hallucination detection framework for RAG applications. - KRLabsOrg/LettuceDetect

Resumen

Resumen Principal

LettuceDetect emerge como una herramienta fundamental y altamente eficiente para combatir las alucinaciones en los sistemas de Generación Aumentada por Recuperación (RAG). Este innovador detector se distingue por su enfoque ligero y su capacidad para identificar con precisión a nivel de token las porciones de una respuesta que carecen de soporte en el contexto proporcionado. Basado en el entrenamiento y la evaluación con el riguroso RAGTruth dataset y potenciado por ModernBERT para un procesamiento de contexto extendido (hasta 4K tokens), LettuceDetect aborda dos limitaciones críticas de los modelos actuales: las restricciones de ventana de contexto de los métodos basados en codificadores tradicionales y la ineficiencia computacional inherente a los enfoques basados en LLMs. Su arquitectura, inspirada en el modelo basado en codificador del Luna paper, le permite ofrecer un rendimiento superior, superando a todos los modelos basados en codificadores y los métodos basados en prompts en el RAGTruth dataset, mientras se mantiene significativamente más rápido y compacto. Incluso rivaliza con LLMs finos ajustados de gran escala, estableciéndose como una solución robusta y práctica para mejorar la fiabilidad de los resultados de RAG.

Elementos Clave

  • Metodología Avanzada y Eficiencia Operacional: LettuceDetect se inspira en el Luna paper para su arquitectura basada en codificadores y emplea ModernBERT para extender su ventana de contexto a 4K tokens, superando las limitaciones de los modelos tradicionales. Esta combinación permite una detección de alucinaciones a nivel de token y un rendimiento optimizado, logrando una inferencia más rápida con un tamaño de modelo menor, lo que lo hace ideal para entornos de producción.
  • Rendimiento Excepcional en RAGTruth: En el exigente RAGTruth dataset, el modelo lettucedetect-large-v1 alcanza un impresionante puntaje F1 del 79.22% a nivel de ejemplo. Este resultado supera significativamente a métodos basados en prompts como GPT-4 (63.4%) y modelos basados en codificadores como Luna (65.4%), e incluso a LLMs finos ajustados como LLAMA-2-13B (78.7%). Además, a nivel de span, demuestra ser el mejor en todas las categorías de datos, consolidando su posición como una de las herramientas más potentes en su clase.
  • Características Orientadas al Desarrollador y Facilidad de Integración: El proyecto promueve la adopción y colaboración al liberar su código y modelos bajo la licencia MIT. Ofrece una integración sencilla mediante una API de Python que se puede instalar vía pip, junto con modelos preentrenados disponibles en Hugging Face (como lettucedetect-base y lettucedetect-large), lo que permite a los desarrolladores implementarlo rápidamente en sus sistemas RAG con pocas líneas de código.
  • Solución Estratégica para Retos de RAG: LettuceDetect aborda directamente los desafíos críticos de los sistemas RAG: la limitada ventana de contexto de los modelos tradicionales y la alta ineficiencia computacional de las aproximaciones basadas en LLMs. Al ofrecer una solución que es altamente performante y computacionalmente eficiente, permite que los sistemas RAG generen respuestas más fiables y fundamentadas, reduciendo la propagación de información incorrecta.

Análisis e Implicaciones

LettuceDetect representa un avance significativo en la mejora de la confianza y fiabilidad de los sistemas RAG, lo que tiene implicaciones directas en la adopción de estas tecnologías en aplicaciones críticas. Su capacidad para identificar alucinaciones con alta precisión y eficiencia permitirá a las organizaciones desplegar

Contenido

LettuceDetect 🥬🔍

LettuceDetect Logo
Because even AI needs a reality check! 🥬

LettuceDetect is a lightweight and efficient tool for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems. It identifies unsupported parts of an answer by comparing it to the provided context. The tool is trained and evaluated on the RAGTruth dataset and leverages ModernBERT for long-context processing, making it ideal for tasks requiring extensive context windows.

Our models are inspired from the Luna paper which is an encoder-based model and uses a similar token-level approach.

PyPI License Hugging Face Open In Colab arXiv

Highlights

  • LettuceDetect addresses two critical limitations of existing hallucination detection models:
    • Context window constraints of traditional encoder-based methods
    • Computational inefficiency of LLM-based approaches
  • Our models currently outperforms all other encoder-based and prompt-based models on the RAGTruth dataset and are significantly faster and smaller
  • Achieves higher score than some fine-tuned LLMs e.g. LLAMA-2-13B presented in RAGTruth, coming up just short of the LLM fine-tuned in the RAG-HAT paper
  • We release the code, the model and the tool under the MIT license

Get going

Features

  • Token-level precision: detect exact hallucinated spans
  • 🚀 Optimized for inference: smaller model size and faster inference
  • 🧠 4K context window via ModernBERT
  • ⚖️ MIT-licensed models & code
  • 🤖 HF Integration: one-line model loading
  • 📦 Easy to use python API: can be downloaded from pip and few lines of code to integrate into your RAG system

Installation

Install from the repository:

From pip:

pip install lettucedetect

Quick Start

Check out our models published to Huggingface:

You can get started right away with just a few lines of code.

from lettucedetect.models.inference import HallucinationDetector

# For a transformer-based approach:
detector = HallucinationDetector(
    method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"
)

contexts = ["France is a country in Europe. The capital of France is Paris. The population of France is 67 million.",]
question = "What is the capital of France? What is the population of France?"
answer = "The capital of France is Paris. The population of France is 69 million."

# Get span-level predictions indicating which parts of the answer are considered hallucinated.
predictions = detector.predict(context=contexts, question=question, answer=answer, output_format="spans")
print("Predictions:", predictions)

# Predictions: [{'start': 31, 'end': 71, 'confidence': 0.9944414496421814, 'text': ' The population of France is 69 million.'}]

Performance

Example level results

We evaluate our model on the test set of the RAGTruth dataset. Our large model, lettucedetect-large-v1, achieves an overall F1 score of 79.22%, outperforming prompt-based methods like GPT-4 (63.4%) and encoder-based models like Luna (65.4%). It also surpasses fine-tuned LLAMA-2-13B (78.7%) (presented in RAGTruth) and is competitive with the SOTA fine-tuned LLAMA-3-8B (83.9%) (presented in the RAG-HAT paper). Overall, lettucedetect-large-v1 and lettucedect-base-v1 are very performant models, while being very effective in inference settings.

The results on the example-level can be seen in the table below.

Example-level Results

Span-level results

At the span level, our model achieves the best scores across all data types, significantly outperforming previous models. The results can be seen in the table below. Note that here we don't compare to models, like RAG-HAT, since they have no span-level evaluation presented.

Span-level Results

How does it work?

The model is a token-level model that predicts whether a token is hallucinated or not. The model is trained to predict the tokens that are hallucinated in the answer given the context and the question.

flowchart LR
    subgraph Inputs
        Context["**Context**: France is a country in Europe. Population is 67 million."]
        Question["**Question**: What is the capital? What is the population?"]
        Answer["**Answer**: The capital of France is Paris. The population is 69 million."]
    end

    Model["**LettuceDetect**: Token Classification"]
    Tokens["**Token Probabilities**: <br> ... <br> The [0.01] <br> population [0.02] <br> is [0.01] <br> 69 [0.95] <br> million [0.95]"]

    Context --> Model
    Question --> Model
    Answer --> Model
    Model --> Tokens

Loading

Training a Model

You need to download the RAGTruth dataset first from here, then put it under the data/ragtruth directory. Then run

python lettucedetect/preprocess/preprocess_ragtruth.py --input_dir data/ragtruth --output_dir data/ragtruth

This will create a data/ragtruth/ragtruth_data.json file which contains the processed data.

Then you can train the model with the following command.

python scripts/train.py \
    --ragtruth-path data/ragtruth/ragtruth_data.json \
    --model-name answerdotai/ModernBERT-base \
    --output-dir output/hallucination_detector \
    --batch-size 4 \
    --epochs 6 \
    --learning-rate 1e-5 

We trained our models for 6 epochs with a batch size of 8 on a single A100 GPU.

Evaluation

You can evaluate the models on each level (example, token and span) and each data-type.

python scripts/evaluate.py \
    --model_path outputs/hallucination_detector \
    --data_path data/ragtruth/ragtruth_data.json \
    --evaluation_type example_level

Model Output Format

The model can output predictions in two formats:

Span Format

[{
    'text': str,        # The hallucinated text
    'start': int,       # Start position in answer
    'end': int,         # End position in answer
    'confidence': float # Model's confidence (0-1)
}]

Token Format

[{
    'token': str,       # The token
    'pred': int,        # 0: supported, 1: hallucinated
    'prob': float       # Model's confidence (0-1)
}]

Streamlit Demo

Check out the Streamlit demo to see the model in action.

Install streamlit:

Run the demo:

streamlit run demo/streamlit_demo.py

Use the Web API

LettuceDetect comes with it's own web API and python client library. To use it, make sure to install the package with the optional API dependencies:

or

pip install lettucedetect[api]

Start the API server with the scripts/start_api.py script:

python scripts/start_api.py dev  # use "prod" for production environments

Usage:

usage: start_api.py [-h] [--model MODEL] [--method {transformer}] {prod,dev}

Start lettucedetect Web API.

positional arguments:
  {prod,dev}            Choose "dev" for development or "prod" for production
                        environments. The serve script uses "fastapi dev" for "dev" or
                        "fastapi run" for "prod" to start the web server. Additionally
                        when choosing the "dev" mode, python modules can be directly
                        imported from the repositroy without installing the package.

options:
  -h, --help            show this help message and exit
  --model MODEL         Path or huggingface URL to the model. The default value is
                        "KRLabsOrg/lettucedect-base-modernbert-en-v1".
  --method {transformer}
                        Hallucination detection method. The default value is
                        "transformer".

Example using the python client library:

from lettucedetect_api.client import LettuceClient

contexts = [
    "France is a country in Europe. "
    "The capital of France is Paris. "
    "The population of France is 67 million.",
]
question = "What is the capital of France? What is the population of France?"
answer = "The capital of France is Paris. The population of France is 69 million."

client = LettuceClient("http://127.0.0.1:8000")
response = client.detect_spans(contexts, question, answer)
print(response.predictions)

# [SpanDetectionItem(start=31, end=71, text=' The population of France is 69 million.', hallucination_score=0.989198625087738)]

See demo/detection_api.ipynb for more examples. For async support use the LettuceClientAsync class instead.

License

MIT License - see LICENSE file for details.

Citation

Please cite the following paper if you use LettuceDetect in your work:

@misc{Kovacs:2025,
      title={LettuceDetect: A Hallucination Detection Framework for RAG Applications}, 
      author={Ádám Kovács and Gábor Recski},
      year={2025},
      eprint={2502.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.17125}, 
}

Fuente: GitHub